# OutlierDetection.jl

*OutlierDetection.jl* is a Julia toolkit for detecting outlying objects, also known as *anomalies*. This package is an effort to make Julia a first-class citizen in the Outlier- and Anomaly-Detection community. *Why should you use this package?*

- Provides a unified API for outlier detection in Julia
- Provides access to state-of-the-art outlier detection algorithms
- Seamlessly integrates with Julia's existing machine learning ecosystem

**Citing**

If you use *OutlierDetection.jl* in a scientific publication, we appreciate citations to:

```
@article{muhr2022outlierdetection,
title={OutlierDetection.jl: A modular outlier detection ecosystem for the Julia programming language},
author={Muhr, David and Affenzeller, Michael and Blaom, Anthony D},
journal={arXiv preprint arXiv:2211.04550},
year={2022}
}
```

or

```
Muhr, David, Michael Affenzeller, and Anthony D. Blaom. "OutlierDetection.jl: A modular outlier detection ecosystem for the Julia programming language." arXiv preprint arXiv:2211.04550 (2022).
```

## Installation

It is recommended to use Pkg.jl for installation. Follow the command below to install the latest official release or use `] add OutlierDetection`

in the Julia REPL.

```
import Pkg
Pkg.add("OutlierDetection")
```

If you would like to modify the package locally, you can use `Pkg.develop("OutlierDetection")`

or `] dev OutlierDetection`

in the Julia REPL. This fetches a full clone of the package to `~/.julia/dev/`

(the path can be changed by setting the environment variable `JULIA_PKG_DEVDIR`

).

## Usage

*OutlierDetection.jl* is built on top of MLJ and provides many `Detector`

implementations for MLJ. A `Detector`

simply assigns a real-valued score to each sample, which is defined to be increasing with increasing outlierness. The detectors live in sub-packages of OutlierDetectionJL, e.g. OutlierDetectionNeighbors,and can be loaded directly with MLJ, as shown below.

```
using MLJ
using OutlierDetection
using OutlierDetectionData: ODDS
# download and open the thyroid benchmark dataset
X, y = ODDS.load("thyroid")
# use 50% of the data for training
train, test = partition(eachindex(y), 0.5, shuffle=true)
# load the detector
KNN = @iload KNNDetector pkg=OutlierDetectionNeighbors
# instantiate a detector with default parameters, returning scores
knn = KNN()
# bind the detector to data and learn a model with all data
knn_raw = machine(knn, X) |> fit!
# transform data to raw outlier scores based on the test data; note that there
# is no `predict` defined for raw detectors
transform(knn_raw, rows=test)
# OutlierDetection.jl provides helper functions to normalize the scores,
# for example using min-max scaling based on the training scores
knn_probas = machine(ProbabilisticDetector(knn), X) |> fit!
# predict outlier probabilities based on the test data
predict(knn_probas, rows=test)
# OutlierDetection.jl also provides helper functions to turn scores into classes,
# for example by imposing a threshold based on the training data percentiles
knn_classifier = machine(DeterministicDetector(knn), X) |> fit!
# predict outlier classes based on the test data
predict(knn_classifier, rows=test)
```

It is also possible to use *OutlierDetection.jl* without MLJ, however, note that more explicit steps are necessary.

```
using OutlierDetection: fit, transform, scale_minmax, classify_quantile, outlier_fraction
using OutlierDetectionNeighbors: KNNDetector # explicitly import detector
using OutlierDetectionData: ODDS
X, y = ODDS.load("thyroid")
knn = KNNDetector()
# explicit conversion to a native array is necessary
# note that we are using the transposed data, because column-major data is expected
Xmatrix = Matrix(X)'
# explicit fit result and training scores
model, scores_train = fit(knn, Xmatrix[:, 11:end]; verbosity = 0)
# transform the first 10 points to scores (not used for training)
scores_test = transform(knn, model, Xmatrix[:, 1:10])
# explicitly normalize train and test scores
proba_train, proba_test = scale_minmax((scores_train, scores_test))
# explicitly convert scores to labels (> 95th percentile would be an outlier)
labels_train, labels_test = classify_quantile(0.95)((scores_train, scores_test))
```

## Algorithms (also known as Detectors)

Algorithms marked with '✓' are implemented in Julia. Algorithms marked with '✓ (py)' are implemented in Python (thanks to the wonderful PyOD library) with an existing Julia interface through PyCall. If you would like to know more, open the detector reference.

Name | Description | Year | Status | Authors |
---|---|---|---|---|

LMDD | Linear deviation-based outlier detection | 1996 | ✓ (py) | Arning et al. |

KNN | Distance-based outliers | 1997 | ✓ | Knorr and Ng |

MCD | Minimum covariance determinant | 1999 | ✓ (py) | Rousseeuw and Driessen |

KNN | Distance to the k-th nearest neighbor | 2000 | ✓ | Ramaswamy |

LOF | Local outlier factor | 2000 | ✓ | Breunig et al. |

OCSVM | One-Class support vector machine | 2001 | ✓ (py) | Schölkopf et al. |

KNN | Sum of distances to the k-nearest neighbors | 2002 | ✓ | Angiulli and Pizzuti |

COF | Connectivity-based outlier factor | 2002 | ✓ | Tang et al. |

LOCI | Local correlation integral | 2003 | ✓ (py) | Papadimitirou et al. |

CBLOF | Cluster-based local outliers | 2003 | ✓ (py) | He et al. |

PCA | Principal component analysis | 2003 | ✓ (py) | Shyu et al. |

IForest | Isolation forest | 2008 | ✓ (py) | Liu et al. |

ABOD | Angle-based outlier detection | 2009 | ✓ | Kriegel et al. |

SOD | Subspace outlier detection | 2009 | ✓ (py) | Kriegel et al. |

HBOS | Histogram-based outlier score | 2012 | ✓ (py) | Goldstein and Dengel |

SOS | Stochastic outlier selection | 2012 | ✓ (py) | Janssens et al. |

AE | Auto-encoder reconstruction loss outliers | 2015 | ✓ | Aggarwal |

ABOD | Stable angle-based outlier detection | 2015 | ✓ | Li et al. |

LODA | Lightweight on-line detector of anomalies | 2016 | ✓ (py) | Pevný |

DeepSAD | Deep semi-supervised anomaly detection | 2019 | ✓ | Ruff et al. |

COPOD | Copula-based outlier detection | 2020 | ✓ (py) | Li et al. |

ROD | Rotation-based outlier detection | 2020 | ✓ (py) | Almardeny et al. |

ESAD | End-to-end semi-supervised anomaly detection | 2020 | ✓ | Huang et al. |

If there are already so many algorithms available in Python - *why Julia, you might ask?* Let's have some fun!

```
using OutlierDetection, MLJ
using BenchmarkTools: @benchmark
X = rand(10, 100000)
LOF = @iload LOFDetector pkg=OutlierDetectionNeighbors
PyLOF = @iload LOFDetector pkg=OutlierDetectionPython
lof = machine(LOF(k=5, algorithm=:kdtree, leafsize=30, parallel=true), X) |> fit!
pylof = machine(PyLOF(n_neighbors=5, algorithm="kd_tree", leaf_size=30, n_jobs=-1), X) |> fit!
```

Julia enables you to implement your favorite algorithm in no time, and it will be fast, *blazingly fast*.

```
@benchmark transform(lof, X)
> median time: 341.464 ms (0.00% GC)
```

Interoperating with Python is easy!

```
@benchmark transform(pylof, X)
> median time: 7.934 s (0.00% GC)
```

## Contributing

OutlierDetection.jl is a community effort and your help is extremely welcome! See our contribution guide for more information how to contribute to the project.

✨

Contributors Thanks go to these wonderful people (emoji key):

_{David Muhr} |
_{Páll Haraldsson} |
_{Anthony Blaom, PhD} |
_{Pietro Monticone} |
_{Petr Mukhachev} |
_{Tyler Thomas} |

This project follows the all-contributors specification. Contributions of any kind welcome!