conclusions.tex

\chapter{Conclusions}
\label{ch:conclusions}

The report is concluded with a short summary and a discussion of a few possible directions for future work.

\section{Summary}

Overall, the project was successful. The new theory introduced in the form of the optimisation problem formulation and the framework represent a step towards powerful, user-friendly tools for automated anomaly detection research. Additionally, ADRT has shown that the framework can be implemented and used to effectively find anomaly detection methods for real applications. Furthermore, the evaluation utilities and evaluation scripts in ADRT have shown that producing open, reproducible anomaly detection research need not be difficult. Finally, as summarized in the next section, the project illuminated several new frontiers for future work.

However, there were some shortcomings. Initially, the project focus was on the implementation and evaluation of a few specific methods for specific datasets. This focus gradually shifted towards a more theoretical one throughout the course of the project, which meant that a large body of work was produced that was ultimately discarded.

Moreover, a proper evaluation of ADRT could not be performed due to lack of data and time. As a result, exactly how practical and useful the framework really is for actual research is still an open question. However, this has been partially mitigated by the provision of reusable evaluation scripts and utilities, which can be used to properly answer this question when the right data becomes available.

\section{Future work}
Several interesting directions for future work have been opened up through this project, a few of which are now discussed.

\subsection{Optimisation}
Developing efficient optimisation methods is key to the successful implementation of the optimisation problem and framework. Currently, only naive optimisation methods have been implemented in ADRT; faster and better methods must be implemented for it to be practical for real-world applications.

There are several interesting paths to take towards this goal. As indicated in the previous chapter, at least for distance-based anomaly measures, errors seem to vary rather smoothly over problem sets, with few non-global error minima or other irregularities. Thus, it can be expected that improved optimisation heuristics, such as simulated annealing, could have a large impact.

Another possibility would be perform the optimisation over a larger set of problems, for instance by allowing combinations of elements of the problem set. This would be a fairly trivial extension, and would allow for the optimisation to produce ensemble-style combinations of methods.

Finally, it seems likely that the optimisation itself could be learned to some degree. For instance, if correlations between the accuracy of problems and the underlying characteristics of the data being analyzed could be found, then these could likely be exploited to provide very efficient optimisation methods. Another, related possibility would be to allow for user-assisted optimisation.

The optimisation process could also be made more efficient through performance improvements. Performance has been deliberately deemphasized in favor of simplicity in the development of ADRT, and it can be expected that dramatic performance improvements could be reaped fairly easily by, for instance, enabling parallelism or rewriting ADRT in a programming language better suited for numeric computations. A performance optimisation that can be expected to have drastic results is the multi-resolution heuristic suggested in Section~\ref{sect:s}.

\subsection{New applications}
There is much work to be done still on applying the framework to sequences. To begin with, task involving finding anomalous sequences, as well as tasks involving discrete and categorical data have yet to be studied and implemented in ADRT. There are also several interesting transformations and anomaly measures which are encountered in the literature, but which have yet to be implemented in ADRT. Finally, how ADRT performs on sets of more diverse real-valued sequences needs to be researched.

Of course, applying the framework and ADRT to entirely different application domains, such as fraud detection or bioinformatics would also be interesting. Essentially, this should be as simple as defining the data and problem sets, as well as (optionally) implementing a few more component choices.

\subsection{Deployment}
A major goal in the design of ADRT was for it to be as modular and flexible as possible, so that creating a user interface for it or hooking it up for use with data analysis software such as Splunk would be easy. This was achieved in two ways: first, by designing it as a collection of command line tools, and second, by making it scriptable. Extending ADRT to work with existing data analysis software or a custom user interface could help further elucidate just how sound and practical the framework is for everyday analysis use, as well as how user-friendly it can be made.

One of the main benefits of the optimisation problem formulation and framework is that they provide a venue for automating the work-intensive aspects anomaly detection research, and replacing them with an automated optimisation process. As was mentioned in Chapter~\ref{ch:background}, anomaly detection research typically necessitates the involvement of both a domain expert and an export on anomaly detection, and is a laborious process involving plenty trial and error. In contrast, ADRT lets users focus almost entirely on more significant aspects of the process: finding methods suitable for a given domain becomes a matter of writing a script which defines the data, places a few restrictions on the problem set, and runs an optimisation. The declarative nature of this work could be taken one step further by defining a domain-specific language for describing the data and the problem set. Reimplementing ADRT in a language with an expressive type system, such as Haskell, would be a good first step in this direction.