An interface package that defines the methods and types for working with features.
Author invenia
0 Stars
Updated Last
1 Year Ago
Started In
April 2021


Stable Dev Build Status Coverage Code Style: Blue ColPrac: Contributor's Guide on Collaborative Practices for Community Packages

FeatureDescriptors.jl is an interface package for describing features used in models, feature engineering, and other data-science workflows.

Getting Started

Descriptors provide a way to define at a high level the properties of - and relationships between - a collection of features and the corresponding data. By associating a type with a given dataset, it allows users to write methods that can dispatch on the given feature and compose pipelines that are agnostic to the characteristics of any particular feature. Subtypes of Descriptors inherit the properties of the supertype, but these can be overloaded as required.

For example, say that some weather data is contained in a weather.csv file where each column describes a different feature we are interested in using, such as :temperature and :humidity. We can define a general Weather Descriptor with subtypes Temperature and Humidity that are loaded from the same table but use the appropriate columns:

using FeatureDescriptors

abstract type Weather <: Descriptor end

FeatureDescriptors.sources(::Type{<:Weather}) = ["weather.csv"]  # only one table is needed
FeatureDescriptors.categorical_keys(::Type{<:Weather}) = []  # no categories necessary

abstract type Temperature <: Weather end
FeatureDescriptors.quantity_key(::Type{Temperature}) = :temperature

abstract type Humidity <: Weather end
FeatureDescriptors.quantity_key(::Type{Humidity}) = :humidity

A more specific instance of a feature can also be defined, such as a MeanTemperature, perhaps if that feature requires some feature engineering before it can be used.

abstract type MeanTemperature <: Temperature end

using FeatureTransforms

# A trivial feature engineering step in preparing MeanTemperature
function FeatureTransforms.transform(D::Type{<:MeanTemperature}, df) 
    return combine(groupby(df, :time), quantity_key(D) => mean => quantity_key(D))

Finally, another useful feature might be stored in a different table entirely, but we may still want to encode its relationships to the others. For example, if we had rainfall data that was saved in another file that we also wanted to use.

abstract type Precipitation <: Weather end
FeatureDescriptors.sources(::Type{<:Precipitation}) = ["rainfall.csv"]
FeatureDescriptors.quantity_key(::Type{<:Precipitation}) = :rainfall 

The Descriptors API

All Descriptors are required to implement the following:

  1. A sources method that specifies where to retrieve the data. A Descriptor may be associated with multiple sources, because it may be necessary to perform feature engineering to create the derived feature.
  2. A quantity_key method that denotes the name of the quantitative variable for the feature, such as :temperature or :price. For the sake of transparency and simplicity, only one quantity_key may be associated with a given Descriptor.
  3. A categorical_keys method that denotes the names of the categorical variables for the feature, such as :colour or :type. If no categorical variables are needed, this returns an empty vector.

You can ensure your Descriptor is implemented correctly by calling the TestUtils.test_interface function.