casperboone
diff --git a/‎report/0-abstract.tex
Lines changed: 13 additions & 0 deletions b/‎report/0-abstract.tex
Lines changed: 13 additions & 0 deletions
diff --git a/‎report/1-introduction.tex
Lines changed: 33 additions & 0 deletions b/‎report/1-introduction.tex
Lines changed: 33 additions & 0 deletions
diff --git a/‎report/2-method.tex
Lines changed: 95 additions & 0 deletions b/‎report/2-method.tex
Lines changed: 95 additions & 0 deletions
@@ -0,0 +1,13 @@
+\begin{abstract}
+Due to the rise of machine learning, Python is an increasingly popular programming language.
+Python, however, is dynamically typed.
+Dynamic typing has shown to have drawbacks when a project grows, while at the same time it improves developer productivity.
+To have the benefits of static typing, combined with high developer productivity, types need to be inferred.
+In this paper, we present \dltpy{}: a deep learning type inference solution for the prediction of types in function signatures based on the natural language context (identifier names, comments and return expressions) of a function.
+We found that \dltpy{} is effective and has a top-3 F1-score of 91.6\%.
+This means that in most of the cases the correct type is within the top-3 predictions.
+We conclude that natural language contained in comments and return expressions are beneficial to predicting types more accurately.
+\dltpy{} does not significantly outperform or underperform the previous work NL2Type for Javascript, but does show that similar prediction is possible for Python.
+\\
+\keywords{deep learning, natural language, type inference, python}
+\end{abstract}
@@ -0,0 +1,33 @@
+\section{Introduction}
+
+Programming languages with dynamic typing, such as Python or JavaScript, are increasingly popular. In fact, supported by the increasing use of machine learning, Python is currently the top programming language in the IEEE Spectrum rankings \cite{2019Interactive:Languages}. Dynamically typed languages do not require manual type annotations and only know the types of variables at run-time. They provide much flexibility and are therefore very suitable for beginners and for fast prototyping. There are, however, drawbacks when a project grows large enough that no single developer knows every single code element of the project. At that point, statically typed languages can check certain naming behaviors based on types automatically whereas dynamic typing requires manual intervention. While there is an ongoing debate on static vs. dynamic typing in the developer community \cite{Ray2017AGitHub}, both excel in certain aspects \cite{Meijer2004StaticLanguages}. There is scientific evidence that static typing provides certain benefits that are useful when software needs to be optimized for efficiency, modularity or safety \cite{Vitousek2014DesignPython}. The benefits include better auto-completion in integrated development environments (IDEs) \cite{Malik2019NL2Type:Information}, more efficient code generation \cite{Vitousek2014DesignPython}, improved maintainability \cite{Hanenberg2014AnMaintainability}, better readability of undocumented source code \cite{Hanenberg2014AnMaintainability}, and preventing certain run-time crashes \cite{Malik2019NL2Type:Information}. It has also been shown that statically typed languages are "less error-prone than functional dynamic languages" \cite{Ray2017AGitHub}.
+
+Weakly typed languages, such as JavaScript, PHP or Python do not provide these benefits. When static typing is needed, there are usually two solutions available. Either using optional type syntax within the language or to use a variant of the language, which is essentially a different language, that does have a type system in place. For JavaScript, there are two often used solutions: Flow \cite{2014Flow}, which uses type annotations within JavaScript, and TypeScript \cite{2012TypeScript}, a JavaScript variant. PHP offers support for type declarations since PHP 7.0 \footnote{https://www.php.net/manual/en/migration70.new-features.php}, and checks these types at run-time. A well-known PHP variant that has strong typing is HackLang \cite{2014HackLang}, by Facebook. Python has support for typing since Python 3.5 \footnote{https://www.python.org/dev/peps/pep-0484/}. It does not do any run-time checking, and therefore these types do not provide any guarantees if no type checker is run. The type checker mypy \cite{2012Mypy} is the most used Python type checker.
+
+Type inference for Python has been addressed from multiple angles \cite{Xu2016PythonSupport,Salib2004FasterStarkiller,MacLachlan1992TheLisp,Hassan2018MaxSMT-Based3c,Maia2012APython}. However, these solutions require some manual annotations to provide accurate results. Having to provide manual annotations is one of the main arguments against static typing because this lowers developer productivity.
+
+In an attempt to mitigate this need for manual annotations and support developers in typing their codebases, we present \dltpy: a deep learning type inference solution based on natural language for the prediction of Python function types. Our work focuses on answering the question of how effective this approach is.
+
+\dltpy{} follows the ideas behind NL2Type \cite{Malik2019NL2Type:Information}, a similar learning-based approach for JavaScript function types. Our solution makes predictions based on comments, on the semantic elements of the function name and argument names, and on the semantic elements of identifiers in the return expressions. The latter is an extension of the ideas proposed in \cite{Malik2019NL2Type:Information}. The idea to use natural language contained in the parameter names for type predictions in Python is not new, Zhaogui Xu et al. already used this idea to develop a probabilistic type inferencer \cite{Xu2016PythonSupport}. Using the natural language of these different elements, we can train a classifier that predicts types. Similar to \cite{Malik2019NL2Type:Information} we use a recurrent neural network (RNN) with a Long Short-Term Memory (LSTM) architecture \cite{Hochreiter1997LongMemory}.
+
+Using 5,996 open source projects mined from GitHub and Libraries.io that are likely to have type annotations, we train the model to predict types of functions without annotations. This works because code has been shown to be repetitive and predictable \cite{Hindle2012OnSoftware}. We make the assumption that comments and identifiers convey the intent of a function \cite{Malik2019NL2Type:Information}.
+
+We train and test multiple variants of \dltpy{} to evaluate the usefulness of certain input elements and the success of different deep learning models. We find that return expressions are improving the accuracy of the model, and that including comments has a positive influence on the results. 
+
+\begin{notsw}
+\dltpy{} predicts types with a top-3 precision of 91.4\%, a top-3 recall of 91.9\%, and a top-3 F1-score of 91.6\%. \dltpy{} does not significantly outperform or underperform the previous work NL2Type \cite{Malik2019NL2Type:Information}.
+\end{notsw}
+
+This paper's contributions are three-fold:
+\begin{enumerate}
+    \item A deep learning network type inference system for inferring types of Python functions
+    \item Evaluation of the usefulness of natural language encoded in return expressions for type predictions
+    \item Evaluation of different deep learning models that indicates their usefulness for this classification task
+\end{enumerate}
+
+% * Previous work in finding types for python (e.g. static analysis, check [4-7] of nl2type)
+% * We present... Look at natural langauge of the function context (comments, params etc.). We formulate the task as a clasification problem. 
+%     * Cite the four reasosns why this works
+%     * Online learning
+% * More developers are using Python, also used in some major packages such as .... We use this data to learn types blabla.
+% * Somethin about the results
@@ -0,0 +1,95 @@
+\begin{figure*}[h]
+\centering
+\includegraphics[width=\textwidth]{"DLTPy flow".pdf}
+\caption{Overview of the process of training from annotated source code.}
+\label{figure:pipeline}
+\end{figure*}
+
+\section{Method} \label{method}
+\dltpy{} has two main phases: a training phase and a prediction phase. In this section, we first describe the steps involved in the training process, and then discuss how prediction works, given the trained model. The training process consists of multiple steps. First, we extract relevant training data from Python projects (section \ref{method:extract}). Next, we preprocess the training data by for instance lemmatizing the textual parts of the data (section \ref{method:preprocess}). The preprocessed training data is then filtered and only relevant functions are selected (section \ref{method:selection}). Then, we generate input vectors using word embeddings and one-hot encoding (section \ref{method:vector}). Finally, we train an RNN (section \ref{method:lstm}). After the training process has completed, the trained RNN can be used to make predictions for function types (section \ref{method:prediction}). 
+
+\subsection{Collecting Data from ASTs} \label{method:extract}
+For each Python project in our data set, we want to export relevant parts of functions. Every Python file is parsed to an abstract syntax tree (AST). From this AST, we find the functions within or outside a class in the Python file. For each function, we extract the following elements:
+\begin{itemize}
+    \item $n_f$: The name of the function
+    \item $d_f$: The docstring of the function
+    \item $c_f$: The comment of the function
+    \item $n_p$: A list of the names of the function parameters
+    \item $t_p$: A list of the types of the function parameters
+    \item $c_p$: A list of the comments describing function parameters
+    \item $e_r$: A list of the return expressions of the function
+    \item $t_r$: The return type of the function
+    \item $c_r$: The comment describing the return value
+\end{itemize}
+
+Together, these elements form the tuple $(n_f, d_f, c_f, n_p, t_p, c_p, e_r, t_r, c_r)$. Figure \ref{figure:pipeline}a shows a code sample, this sample is parsed and the information for the tuple is extracted as described in Figure \ref{figure:pipeline}b. This tuple is similar to the input data used in NL2Type \cite{Malik2019NL2Type:Information}, except for $d_f$ and $e_r$.
+
+$d_f$ is the docstring of the Python function. This docstring often contains a few lines of text describing the working of the function, and sometimes also contains information about the parameters or the return value. In some cases, a more structured format is used, such as ReST, Google, or NumPy style. These formats describe parameters and the return value, separately from the function description. In these cases, we can extract this information for $c_f$, $c_p$, and $c_r$. We extract these comments only if the docstring is one of the structured formats mentioned before.
+
+$e_r$ is a list of return expressions of the function. After the preprocessing step (section \ref{method:preprocess}), this contains a list of all the identifiers and keywords used in the return expressions. The intuition is that often variable names are returned and that these names may convey useful information. 
+
+\subsection{Preprocessing} \label{method:preprocess}
+The information in the tuple is still raw natural language text. To capture only the relevant parts of the text, we first preprocess the elements in the tuple. The preprocessing pipeline consists of four steps and is based on the preprocessing stage in \cite{Malik2019NL2Type:Information}:
+
+\begin{enumerate}
+    \item \textbf{Remove punctuation, line breaks, and digits} We replace all non-alphabetical characters. Line breaks are also removed to create a single piece of text. We replace a full stop that is not at the end of a sentence with a space. We do this to make sure that, for instance, an object field or function access is not treated as a sentence separator (for example \texttt{object.property} becomes \texttt{object property}).
+    \item \textbf{Tokenize} We tokenize sentences using spaces as a separator. Before tokenization, the underscores in snake case and camel case identifiers are converted to a space-separated sequence of words. 
+    \item \textbf{Lemmatize} We convert all inflected words to their lemma. For example, ``removing'' and ``removed'' become ``remove''.
+    \item \textbf{Remove stop words} We remove stopwords (such as ``was'', ``be'', ``and'', ``while'' and ``the'' \footnote{See https://gist.github.com/sebleier/554280 for a full list of stopwords}) from the sentences because these words are often less relevant and thus more importance can be given to non-stopwords. This step is not included in the pipeline for identifiers (function names, parameter names, and return expressions), considering that in the short sentences these identifiers form, stopwords are more relevant.
+\end{enumerate}
+
+An example of a preprocessed tuple is shown in Figure \ref{figure:pipeline}c.
+
+\subsection{Function Selection} \label{method:selection}
+After collecting and preprocessing the function tuples, we select relevant functions. We filter the set of functions on a few criteria.
+
+First, a function must have at least one type in $t_p$ or it must have $t_r$, otherwise, it cannot serve as training data. A function must also have at least one return expression in $r_e$, since we do not want to predict the type for a function that does not return anything.
+
+Furthermore, for functions where $n_p$ contains the parameter \texttt{self}, we remove this parameter from $n_p$, $t_p$ and $c_p$, since this parameter has a specific role for accessing the instance of the class in which the method is defined in. Therefore, the name of this parameter does not reflect any information about its type and is thus not relevant.
+
+Finally, we do not predict the types \texttt{None} (can be determined statically) and \texttt{Any} (is always correct). Thus, we do not consider a function for predicting a parameter type if the parameter \texttt{Any}, and a return type if the return type is \texttt{Any} or \texttt{None}.
+
+\subsection{Vector Representation} \label{method:vector}
+From the selected function tuples, we create a parameter datapoint for each parameter and a return datapoint. We convert these datapoints to a vector. We explain the structure of these vectors in \ref{method:vector:structure}.  All textual elements are converted using word embeddings (see \ref{method:vector:embeddings}), and types with one-hot encoded (see \ref{method:vector:types}).
+
+\subsubsection{Datapoints and Vector Structure} \label{method:vector:structure}
+
+\input{tables/vectors.tex}
+
+The format of the input vectors is shown in Table \ref{table:vector-param} for parameter datapoints, and in Table \ref{table:vector-return} for return datapoints. All elements of the features have size 100. This results in a 55 $\times$ 100 input vector.
+
+The lengths of the features are based on an analysis of the features in our dataset. The results are shown in Table \ref{table:feature-lengths}. A full analysis is available in our GitHub repository (see section \ref{evaluation:implementation}).
+
+\input{tables/feature_lengths.tex}
+
+The datapoint type indicates whether the vector represents a parameter or a vector. A separator is a 1-vector of size 100. For parameter datapoints, padding (0-vectors) is used to ensure that the vectors for both datapoints have the same dimensions.
+
+\subsubsection{Learning Embeddings} \label{method:vector:embeddings}
+It is important that semantically similar words result in vectors that are close to each other in the n-dimensional vector space, hence we cannot assign random vectors to words. Instead, we train an embeddings model that builds upon Word2Vec \cite{Mikolov2013EfficientSpace}. Since the meaning of certain words within the context of a (specific) programming language are different than the meaning of those words within the English language, we cannot use pre-trained embeddings.
+
+We train embeddings separately for comments and identifiers. Comments are often long (sequences of) sentences, while identifiers can be seen as short sentences. Similarly to \cite{Malik2019NL2Type:Information}, we train two embeddings, because the identifiers ``tend to contain more source code-specific jargon and abbreviations than comments''.
+
+\begin{notsw}
+Using the trained model, we convert all textual elements in the datapoints to sequences of vectors.
+
+For the training itself, all words that occur 5 times or less are not considered to prevent overfitting. Since Word2Vec learns the context of a word by considering a certain amount of neighbouring words in a sequence, this amount of set to 5.
+The dimension of the word embedding itself is found by counting all the unique words found in the comments and identifiers and taking the 4th root of the result as suggested in \cite{TensorFlowTeam2017IntroducingColumnss}. This results in a recommended dimension of 14.
+\end{notsw}
+
+
+\subsubsection{Representing Types}\label{method:vector:types}
+The parameter types and return type are not embedded, however, we also encode these elements as vectors. We use a one-hot encoding \cite{Neter1996AppliedModels} that encodes to vectors of length $|T_{frequent}|$, where $T_{frequent}$ is the set of types that most frequently occur within the dataset. We also add the type ``other'' to $T_{frequent}$ to represent all types not present in the set of most frequently occurring types. We only select the most frequent types because there is not enough training data for less frequent types, resulting in a less effective learning process. The resulting vector for a type has all zeros except at the location corresponding to the type, for example, the type \texttt{str} may be encoded as $[0, 1, 0, 0, ..., 0]$.
+
+We limit the set $T_{frequent}$ to the 1000 most frequent types, as this has shown to be an effective number in earlier work \cite{Malik2019NL2Type:Information}. We show the top 10 of the most frequent types in Table \ref{table:most-frequent-types}.
+
+\input{tables/most_frequent_types.tex}
+
+\subsection{Training the RNN} \label{method:lstm}
+Given the vector representations described in section (\ref{method:vector}) we want to learn a function that would map the input vectors $x$ of dimensionality k to one of the 1000 types T, hence that would create the mapping $\mathbb{R}^{x*k}->\mathbb{R}^{|T|}$. To learn this mapping, we train a recurrent neural network (RNN). An RNN has feedback connections, giving it memory about previous input and therefore the ability to process (ordered) sequences of text. This makes it a good choice when working with natural language information.
+
+We implement the RNN using LSTM units \cite{Gers1999LearningLSTM}. LSTM units have been successfully applied in NL2Type \cite{Malik2019NL2Type:Information}, where the choice for LSTMs is made based on the use for classification tasks similar to our problem. We describe the full details of the model in \ref{evaluation:experiments:models}.
+
+\subsection{Prediction using the trained RNN} \label{method:prediction}
+After training is done, the model can be used to predict the type for new, unseen, functions. The input to the model is similar to the input during the training phase. This means that first a function needs to be collected from an AST (section \ref{method:extract}), then the function elements need to be preprocessed (section \ref{method:preprocess}, and finally, the function must be represented as multiple vectors for the parameter types and for the return type as described in \ref{method:vector}.
+
+The model can now be queried for the input vectors to predict the corresponding types. The network outputs a set of likely types together with the individual probability of the correctness of these types.