v_1.0 change note:
Significantly improved accuracy of the model (RMSE to 0.34)
Updated to BigSolDB v2.0 dataset with expanded coverage
Enhanced feature engineering with symmetric architecture
This app is based on training of more than 110,000 molecules with comprehensive solubility data
points at different temperatures for different solvents. About 10,000 solubility data in water
were taken from the AqSolDB database
https://doi.org/10.1038/s41597-019-0151-1
and another 104,000 data points from BigSolDB v2.0
https://doi.org/10.1038/s41597-025-05559-8
,
covering 213 different solvents with temperatures ranging from 243 to 425 K.
The machine learning models for making this app are based on XGBoost and a 1D convolutional
neural network implemented using PyTorch. After training the two models, final prediction is
taken by weighted ensemble optimization on validation data. Featurization of the compositions is
based on RDKit molecular descriptors with a symmetric feature architecture: solute descriptors +
temperature + solvent descriptors, where both solute and solvent are represented by 209
physicochemical descriptors each.
The ensemble model achieved RMSE of 0.34 log units on held-out test data (R² = 0.945),
representing a significant improvement over the previous version. The model performs
consistently across both aqueous and organic solvents.
v_0.30 change note:
- improved accuracy of the model (RMSE to 0.42)
This app is based on training of more than 10000 molecules with more than 60000
solubility data points at different temperatures for different solvent.
About 9000 solubility data in water was taken from AqSolDB database
https://doi.org/10.1038/s41597-019-0151-1
and another 54000 data points from BigSolDB https://doi.org/10.26434/chemrxiv-2023-qqslt
The machine learning models for making this app are based on
XGBoost and convelutional neural network
with a customrised loss function implemented using Pytorch. After training the two models,
final prediction was taken
by the weighted model outputs. Featurelization of the compositions was based on pakage
of RDkit
The model achived RMSE of about 0.44 for the tested solubility in logrithmic scale.