Increasing the ability to investigate plant functions and structure with high throughput methods has become a major target in plant breeding and precision agriculture. As a low cost, high-precision and high-throughput technique, near-infrared spectroscopy (NIRS) can predict contents of products by combining the spectral information with laboratory data, thanks to an appropriate predictive model.
In order to deal with high number of explanatory variables, high sensitivity to sample physical characteristics (e.g. flour size, wetness), and high information redundancy, traditional NIRS chemometrics, i.e., the science that employs statistical and mathematical methods to explain near-infrared spectra, combine one or more pretreatments (e.g. multi scatter correction, detrend) of the spectra and a calibration model. The number of available spectra pretreatments and model types increase constantly and differ between studies and analytes of a same study (i.e. starch, carotenoid, cyanide). Despite prior knowledge on pre-treatment and sample presentation (e.g. fresh, dried, milled), choosing among these methods and their combinations becomes more and more tedious.
In order to take the most out of recent advances in data science and computational capabilities, this study aim at developing a generic NIRS calibration pipeline, combining advantages of traditional approaches (e.g. partial last square regression) and modern deep learning technics. Indeed, spatiotemporal algorithm dedicated to 2D and 3D signals analysis (e.g. LSTM, RNN, CNN) will be used for a better integration of spectra spatial information. Moreover, they are able to mimic the spectra pretreatment step using kernel, activation function and convolutional layers.
Meanwhile, the increasing computational capabilities and improvements in heuristic search algorithms offer the possibility to investigate a broader search space of spectra pretreatment, model type and hyperparameters combinations. In order to take advantage of this diversity, we will use model ensembling techniques (i.e. model stacking) to gather complementary information brought by the multiple base models into an improved meta-model. In order to better choose diverse and performant base models, the use of generative adversarial network, game theory and distributed artificial intelligence (multiagent systems) will be investigated.
Challenges will be to develop a methodological and functional pipeline (i) with enough modularity to allow integration of innovative methods (e.g. attention based neural network) , (ii) with optimized metrics to manage the selection of base models despite the tradeoff between performances and diversity, (iii) with sufficient genericity for different products, analytes and database sizes and, (iv) with sufficient speed.