Jchemo.jl

Julia package for regression and discrimination, with focus on high-dimensional data (e.g. Partial Least Squares Regression)
Author mlesnoff
Popularity
7 Stars
Updated Last
1 Year Ago
Started In
May 2021

Jchemo.jl

Dimension reduction, Regression and Discrimination for Chemometrics

Stable Dev Build Status Project Status: Active - The project has reached a stable, usable state and is being actively developed.

Jchemo.jl is a package for data exploration and prediction with focus on high dimensional data.

The package was initially designed about partial least squares regression and discrimination models and variants, in particular locally weighted PLS models (LWPLS) (e.g. https://doi.org/10.1002/cem.3209). Then, it has been expanded to many other methods for analyzing high dimensional data.

The package was named Jchemo since it is orientated to chemometrics, but most of the provided methods are fully generic to other domains.

Functions such as transform, predict, coef and summary are available. Tuning the predictive models is facilitated by generic functions gridscore (validation dataset) and gridcv (cross-validation). Faster versions of these functions are also available for models based on latent variables (LVs) (gridscorelv and gridcvlv) and ridge regularization (gridscorelb and gridcvlb).

Most of the functions of the package have a help page (providing an example), e.g.:

?plskern

Examples demonstrating Jchemo.jl are available in project JchemoDemo, used for training only. The datasets of the examples are stored in package JchemoData.jl.

Some of the functions of the package (in particular those using kNN selections) use multi-threading to speed the computations. Taking advantage of this requires to specify a relevant number of threads (e.g. from the 'Settings' menu of the VsCode Julia extension and the file 'settings.json').

Jchemo.jl uses Makie.jl for plotting. To install and load one of the Makie's backends (e.g. CairoMakie.jl) is required to display the plots.

Before to update the package, it is recommended to have a look on What changed to avoid problems due to eventual breaking changes.

Dependent packages

Installation

In order to install Jchemo, run in the Pkg REPL:

pkg> add Jchemo

or for a specific version:

pkg> add Jchemo@0.1.18

or for the current developing version (not 100% stable):

pkg> add https://github.com/mlesnoff/Jchemo.jl.git

Benchmark - Computation time for a PLS with 1e6 observations

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 8 on 16 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
using Jchemo

## PLS2 with 1e6 observations
## (NB.: multi-threading is not used in plskern) 
n = 10^6  # nb. observations (samples)
p = 500   # nb. X-variables (features)
q = 10    # nb. Y-variables to predict
X = rand(n, p)
Y = rand(n, q)
nlv = 25  # nb. PLS latent variables

@time plskern(X, Y; nlv = nlv) ;
8.100469 seconds (299 allocations: 4.130 GiB, 6.58% gc time)

@time plskern!(X, Y; nlv = nlv) ;
7.232234 seconds (6.47 k allocations: 338.617 MiB, 7.39% gc time, 0.13% compilation time)

Examples of syntax for predictive models

Fitting a model

using Jchemo

n = 150 ; p = 200 ; q = 2 ; m = 50 
Xtrain = rand(n, p) ; Ytrain = rand(n, q) 
Xtest = rand(m, p) ; Ytest = rand(m, q) 

## Model fitting
nlv = 5 
fm = plskern(Xtrain, Ytrain; nlv = nlv) ;
pnames(fm) # print the names of objects contained in 'fm'

## Some summary
summary(fm, Xtrain)

## Computation of the PLS scores (LVs) for Xtest
Jchemo.transform(fm, Xtest)
Jchemo.transform(fm, Xtest; nlv = 1)

## PLS b-coefficients
Jchemo.coef(fm)
Jchemo.coef(fm; nlv = 2)

## Predictions and performance of the fitted model
res = Jchemo.predict(fm, Xtest) 
res.pred
rmsep(res.pred, Ytest)
mse(res.pred, Ytest)

Jchemo.predict(fm, Xtest).pred
Jchemo.predict(fm, Xtest; nlv = 0:3).pred 

Tuning a model by grid-search

  • With gridscore

using Jchemo, StatsBase, CairoMakie

ntrain = 150 ; p = 200
ntest = 80 
Xtrain = rand(ntrain, p) ; ytrain = rand(ntrain) 
Xtest = rand(ntest, p) ; ytest = rand(ntest)
## Train is splitted to Cal+Val to tune the model,
## and the generalization error is estimated on Test.
nval = 50
s = sample(1:ntrain, nval; replace = false) 
Xcal = rmrow(Xtrain, s)
ycal = rmrow(ytrain, s)
Xval = Xtrain[s, :]
yval = ytrain[s]

## Computation of the performance over the grid
## (the model is fitted on Cal, and the performance is 
## computed on Val)
nlv = 0:10 
res = gridscorelv(
    Xcal, ycal, Xval, yval;
    score = rmsep, fun = plskern, nlv = nlv) 

## Plot the results
plotgrid(res.nlv, res.y1,
    xlabel = "Nb. LVs", ylabel = "RMSEP").f

## Predictions and performance of the best model
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
fm = plskern(Xtrain, ytrain; nlv = res.nlv[u]) ;
res = Jchemo.predict(fm, Xtest) 
rmsep(res.pred, ytest)

## *Note*: For PLSR models, using gridscorelv is much faster
## than using the generic function gridscore.
## In the same manner, for ridge regression models,
## gridscorelb is much faster than gridscore.

## Syntax for the generic gridscore
pars = mpar(nlv = nlv)
res = gridscore(
    Xcal, ycal, Xval, yval;
    score = rmsep, fun = plskern, pars = pars) 
  • With gridcv

using Jchemo, StatsBase, CairoMakie

ntrain = 150 ; p = 200
ntest = 80 
Xtrain = rand(ntrain, p) ; ytrain = rand(ntrain) 
Xtest = rand(ntest, p) ; ytest = rand(ntest)
## Train is used to tune the model,
## and the generalization error is estimated on Test.

## Build the cross-validation (CV) segments
## Replicated K-Fold CV
K = 5      # Nb. folds
rep = 10   # Nb. replications (rep = 1 ==> no replication)
segm = segmkf(ntrain, K; rep = rep)

## Or replicated test-set CV
m = 30     # Size of the test-set
rep = 10   # Nb. replications (rep = 1 ==> no replication)
segm = segmts(ntrain, m; rep = rep) 

## Computation of the performances over the grid
nlv = 0:10 
rescv = gridcvlv(
    Xtrain, ytrain; segm = segm,
    score = rmsep, fun = plskern, nlv = nlv) ;
pnames(rescv)
res = rescv.res

## Plot the results
plotgrid(res.nlv, res.y1,
    xlabel = "Nb. LVs", ylabel = "RMSEP").f

## Predictions and performance of the best model 
u = findall(res.y1 .== minimum(res.y1))[1] 
res[u, :]
fm = plskern(Xtrain, ytrain; nlv = res.nlv[u]) ;
res = Jchemo.predict(fm, Xtest) 
rmsep(res.pred, ytest)

## *Note*: For PLSR models, using gridcvlv is much faster
## than using the generic function gridcv.
## In the same manner, for ridge regression models,
## gridcvlb is much faster than gridcv.

## Using the generic function gridcv:
pars = mpar(nlv = nlv)
rescv = gridcv(
    Xtrain, ytrain; segm = segm,
    score = rmsep, fun = plskern, pars = pars) ;
pnames(rescv)
res = rescv.res

Author

Matthieu Lesnoff

matthieu.lesnoff@cirad.fr

How to cite

Lesnoff, M. 2021. Jchemo: a Julia package for dimension reduction, regression and discrimination for chemometrics. https://github.com/mlesnoff/Jchemo. CIRAD, UMR SELMET, Montpellier, France

Acknowledgments