Predicting Abraham model solvent coefficients

Modeling

We calculated CDK descriptors for each solvent using the cdkdescui 30] and then created five random forest models for e0, s0, a0, b0, and v0 using R. The resulting models had out of bag (OOB) R2 values ranging between the barely significant 0.31 for e0 to the very seignificant 0.92 for a0, see the Open Notebook page for more details 29]. It is important to note that due to the limited number of data points, we decided
not to split the data into training and test sets and instead use the OOB values which
are automatically generated with random forest models as our means of validation.
A summary of the modeling results can be found in Table 2.

Table 2. Summary of statistical measures of the results of modeling

Quite why some endpoints are more difficult to model than others is not known. Comparing
the OOB R2 values with the standard deviation of the endpoints (e0: 0.31, s0: 0.77, a0: 0.92, b0:0.47, and v0: 0.63) we see no negative correlation between the range of a given endpoint and the
actual prediction performances of the associated models as one would possibly suspect.
It is our conjecture that as more measured values become available that refined models
will have better performance. For now, these models should be used only as an initial
starting point for exploring the wider solvent chemical space.

Errors in the predications of the coefficients for new solvents are not equivalent
because when used to predict partition coefficients they are scaled by their corresponding
Abraham descriptors, see equation 3. Thus, on average, when predicting solvent coefficients for new solvents, the errors
in predicting v and s are more significant that errors in predicting a and b due to
the difference in the sizes of average values for the solute descriptors. Multiplying
the OOB-RMSE for each coefficient by the corresponding average descriptors value we
see the following scaled RMSE values for e0, s0, a0, b0, and v0 of 0.16, 0.33, 0.08, 0.23, and 0.30 respectively. Thus the poor OOB R2 values for e0 (0.31) and b0 (0.47) seem not to be as detrimental to the applicability of the model as suggested
by a first glance.

To analyze the modeling results further and to investigate model outliers we calculated
an adjusted error D, the distance between the observed values and the predicted values
scaled by the average descriptor values, for each solvent using the following equation:

(4)

where the superscript p indicates the predicted value. These distances were then plotted
as colors on a graph with the x and y axes corresponding to the first two principal
components of the measured values for e0, s0, a0, b0, and v0, see Figure 1. Those solvents colored red have higher calculated distances between their measured
and predicted values [Figure 1].

Figure 1. Performance of the models on the existing chemical space of solvents with known coefficients. The red color indicates poor performance – model outliers.

As we can see from the figure, model outliers include: formamide, trifluoroethanol,
carbon disulfide, and DMSO. These solvents are on the outskirts of the chemical space.
In fact, we can clearly see that the model makes far better predictions for solvents
towards the center of the chemical space with particular success in predicting the
coefficients for series such as alkanes and alcohols. These observations should give
us caution when using the models to predict the solvent coefficients for novel solvents,
especially when they do not lie within the chemical space established by solvents
with known coefficients.

These Open Models (CC0) can be downloaded from the Open Notebook pages 29],31] and can be used to predict the solvent coefficients for any organic solvent; either
with the view of predicting partition coefficients or other partitioning processes
including solubilities via equation (1); or with the view of finding replacement and novel solvents for current syntheses,
recrystallization procedures, and other solvent dependent processes 32]. As an informational note we remind readers that solute solubility and partitioning
are only two of the considerations in finding an appropriate replacement solvent.
Other considerations include the toxicity and the purchase price of the solvent, disposal
costs of the solvent, physical properties of the solvent, and whether or not the solvent
undergoes any undesired chemical reactions with other chemical compounds that might
be present in the solution. For example, some chemical reactions take place at elevated
temperatures and here one would want to use a solvent having a sufficiently high boiling
point temperature that it would not vaporize under the experimental conditions.

Sustainable solvents

As an example of the application of our models, we used our models to calculate the
solvent descriptors for a list of sustainable solvents from a paper by Moity et. al.33]. The resulting coefficients for 119 select novel sustainable solvents are presented
in Table 3. A complete set of coefficients for all 293 solvents (sustainable, classic, and measured)
can be found in Additional file 2. These values should be used in light of the limitation of the model as described
above, as possible starting places for further investigation, and not as gospel.

Table 3. Predicted solvent coefficients for select sustainable solvents

By comparing the predicted solvent coefficients to that of solvents with measured
coefficients, we can make solvent replacement suggestions both in general and in particular.
In general, the distance between solvents can be measured as the difference in predicted
solubilities for the average compound.

(5)

(6)

Using this method we found several possible replacements. For example, 1,2-propylene
glycol (e0?=?0.387, s0?=??0.447, a0?=?0.259, b0?=??3.447, v0?=?3.586) and methanol (e0?=?0.312, s0?=??0.649, a0?=?0.330, b0?=??3.355, v0?=?3.691) have a d-value of 0.07. This suggests that 1,2-propylene glycol may be a
general sustainable solvent replacement for methanol. To confirm our model’s suggestion,
we compared the solubilities of compounds from the Open Notebook Science Challenge
solubility database 34] that had solubility values for both 1,2-propylene glycol and methanol, see [Figure 2].

Figure 2. Experimental solubilities in both methanol and 1,2-propylene glycol.

Examining Figure 2, we see that solubility values are of the same order in most cases. The biggest discrepancy
being for dimethyl fumerate. The measured solubility values are reported to be 0.182 M
and 0.005 M for methanol and propylene glycol respectively 34], whereas the predicted solubilities are 0.174 M for methanol and 0.232 M for propylene
glycol based upon the Abraham descriptors: E?=?0.292, S?=?1.511, A?=?0.000, B?=?0.456,
V?=?1.060 35]. This suggests that the reported value for the solubility of dimethyl fumerate in
ethylene glycol may be incorrect and that, in general, 1,2-propylene glycol is a sustainable
solvent replacement for methanol.

Other strongly suggested general replacements include: dimethyl adipate for hexane,
ethanol/water(50:50)vol for o-dichlorobenzene, and alpha-pinene for 1,1,1-trichloroethane.
Many more replacement suggestions can be generated by this technique.

In a similar manner to the above procedure for general solvent replacement for all
possible solutes, one can easily compare partition and solvation properties across
all solvents for a specific solute (or set of solutes) with known or predicted Abraham
descriptors (E, S, A, B, V). For example, using descriptors E?=?0.730, S?=?0.90, A?=?0.59,
B?=?0.40, V?=?0.9317 for benzoic acid (and using d?=?0.001), we can make several benzoic
acid-specific solvent replacement recommendations, see Table 4. These replacement suggestions do not seem unreasonable chemically and several examples
can be explicitly verified by comparing actual measured solubility values 34]. Such a procedure can easily be done for other specific compounds with known or predicted
Abraham descriptors to find alternative green solvents in varying specific circumstances
(solubility, partition, etc.).

Table 4. Replacement solvent suggestions for procedures involving benzoic acid

In addition to sustainable solvents, we also considered the list of commonly used
solvents in the pharmaceutical industry 36]. Of all the solvents listed, the only one not covered previously by this work (Additional
file 2) was 4-methylpent-3-en-2-one which has SMILES: O?=?C(C?=?C(/C)C)C and predicted
solvent coefficients: e0?=?0.269, s0?=??0.362, a0?=??0.610, b0?=??4.830, v0?=?4.240.