Molecular Properties: A Journey through Multiple Linear Regression
Let’s start with loading up all the libraries that I will use:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import networkx as nx
import seaborn as sns
from sklearn import preprocessing, tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
import lightgbm
from sklearn.model_selection import KFold
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
from keras import optimizers
import keras
import warnings
warnings.filterwarnings('ignore')
And then let’s load the library and look at some basic statistics:
df= pd.read_csv('molecule_complete.csv') df.columns #the names of all the columnsIndex(['molecule_name', 'atom_index_0', 'atom_index_1', 'type', 'scalar_coupling_constant', 'potential_energy', 'X', 'Y', 'Z', 'XX_atom1', 'YX_atom1', 'ZX_atom1', 'XY_atom1', 'YY_atom1', 'ZY_atom1', 'XZ_atom1', 'YZ_atom1', 'ZZ_atom1', 'XX', 'YX', 'ZX', 'XY', 'YY', 'ZY', 'XZ', 'YZ', 'ZZ', 'mulliken_charge_atom1', 'mulliken_charge', 'type_scc', 'fc', 'sd', 'pso', 'dso', 'atom_atom1_structure', 'x_atom1_structure', 'y_atom1_structure', 'z_atom1_structure', 'atom', 'x', 'y', 'z'], dtype='object')df.select_dtypes(include=[object]) #this shows the categorical variables

Now to look at some visuals. This was thanks to Kaggle Competitor Andrew:
fig, ax = plt.subplots(figsize = (20, 12)) for i, t in enumerate(df[‘type’].unique()): df_type = df.loc[df[‘type’] == t] G = nx.from_pandas_edgelist(df_type, ‘atom_index_0’, ‘atom_index_1’, [‘scalar_coupling_constant’]) plt.subplot(2, 4, i + 1); nx.draw(G, with_labels=True); plt.title(f’Graph for type {t}’)
This helps create a network graph of all categorical variables in relation to the scalar coupling constant. I have also went ahead and created more of this relational graph for the fc, muliken_charge, pso, sd and dso as they were also determined to be important features from my previous work.

This can be a bit difficult to interpret. However, we see that some types are more clustered together while others like 2JHH forms rather unique relation to scalar. There is a slightly better way to visualize the impact of each type has on the target atom using Tableau Visual:


Combining the above two pictures, we can understand why 2JHH has such and unique schema and while atoms like 1JHC and 3JHC are similar and more clustered together. Now let’s create a scalar constant bin and a violin plot using type and scalar constant:
#this was done for all important features from the LGBM feature extraction model, but I will only show scalar here.
fig, ax = plt.subplots(figsize = (18, 6))
plt.subplot(1, 2, 1);
plt.hist(df['scalar_coupling_constant'], bins=20);
plt.title('Basic scalar_coupling_constant histogram');
plt.subplot(1, 2, 2);
sns.violinplot(x='type', y='scalar_coupling_constant', data=df);
plt.title('Violinplot of scalar_coupling_constant by type');

After this, I converted all categorical variables into dummy variables using LabelEncoder, create a training and testing set and finally ran the train-test-split from sci-kit learn.
for f in ['type', 'type_scc', 'atom_atom1_structure', 'atom']: lbl = preprocessing.LabelEncoder() lbl.fit(list(df[f].values)) df[f] = lbl.transform(list(df[f].values))train= df.drop(['molecule_name', 'scalar_coupling_constant'], axis=1) test= df['scalar_coupling_constant']feature_train, feature_test, target_train, target_test= train_test_split(train, test, test_size=0.12)
This gave me the following:
total feature training features: 4099169
total feature testing features: 558978
total target training features: 4099169
total target testing features: 558978
In my previous models, I did not use any categorical variables. So I decided to run the LGBM model again, but include the dummy variables:
train_data = lightgbm.Dataset(feature_train, label=target_train) test_data = lightgbm.Dataset(feature_test, label=target_test)#Create some parameters: n_fold = 7 folds = KFold(n_splits=n_fold, shuffle=True, random_state=4) #The formatting below is a bit messy but please be aware of proper indentation. Copy paste into medium is messing up proper indentationparameters = {'num_leaves': 250, 'min_child_samples': 75, 'objective': 'regression', 'max_depth': 24, 'learning_rate': 0.001, "boosting_type": "gbdt", "subsample_freq": 3, "subsample": 0.9, "bagging_seed": 15, "metric": 'mae', "verbosity": -1, 'reg_alpha': 0.01, 'reg_lambda': 0.3, 'colsample_bytree': 1.0}#Now to run the model: model = lightgbm.train(parameters, train_data, valid_sets=test_data, num_boost_round=5000, early_stopping_rounds=100) #I should have used 10,000 but 5000 was still great#Time to visualize the new feature importance change: ax = lightgbm.plot_importance(model, max_num_features=40, figsize=(15,15)) plt.show()

I did one more visual that focused on accuracy instead of mean absolute error and it gave a slightly different importance in features but it only affected the order of the top 5 features. Then I realized that since the target variable contains mostly rather small number but not all features do, it was creating some form of bias for features with bigger numbers. I then decided to normalize the data:
normal_feature=df.drop(['molecule_name', 'scalar_coupling_constant'], axis=1) target= df['scalar_coupling_constant']x= normal_feature.values min_max_scaler = preprocessing.MinMaxScaler()x_scaled = min_max_scaler.fit_transform(x) #this gave an array instead of a dataframe. Now to convert this array into a dataframe and then properly change all column names back to the originalnf = pd.DataFrame(x_scaled)#This will reassign the names. It would have been easier to create a loop function but since I did it once already, I just copy pasted and added the additional dummy variables innormfeat= nf.rename(columns={0: 'atom_index_0', 1:'atom_index_1', 2: 'type', 3:'potential_energy', 4: 'X', 5: 'Y', 6: 'Z', 7: 'XX_atom1', 8: 'YX_atom1', 9: 'ZX_atom1', 10: 'XY_atom1', 11: 'YY_atom1', 12: 'ZY_atom1',13: 'XZ_atom1', 14: 'YZ_atom1', 15: 'ZZ_atom1', 16: 'XX', 17: 'YX', 18: 'ZX', 19: 'XY', 20: 'YY', 21:'ZY', 22:'XZ', 23:'YZ', 24:'ZZ', 25: 'mulliken_charge_atom1', 26: 'mulliken_charge', 27: 'type_scc', 28: 'fc', 29: 'sd', 30:'pso',31: 'dso', 32: 'atom_atom1_structure', 33:'x_atom1_structure', 34:'y_atom1_structure', 35:'z_atom1_structure', 36: 'atom', 37:'x', 38:'y', 39:'z' })
Now to create a new train-test-split as these are new variables essentially.
features= normfeat target= df[[‘scalar_coupling_constant’]] feature_train, feature_test, target_train, target_test= train_test_split(features, target, test_size=0.1)#Then to convert it all to numpy array for easier calculations: feature_array= feature_train.to_numpy() target_array= target_train.to_numpy()
With that done, I decided to move into actual machine modeling. I did not do another LGBM model as I did not think it would change the features much more even after normalizing but you can try it out and let me know. I started with the basic linear regression model and measure using negative mean squared error. This means the smaller the number the better the result:
linear= LinearRegression() mse= cross_val_score(linear, feature_array, target_array, scoring='neg_mean_squared_error', cv= 20) mean_mse= np.mean(mse)mse= -6.683008610663098e-09
I created a new dataframe just for the target test variable for future comparison between the ridge and lasso models:
target_test= target_test.reset_index()target_test=target_test.drop(['index'], axis=1) #Only run the above codes once or it will give you errortarget_test=target_test.rename(columns={'scalar_coupling_constant': 'Actual Test Scalar'})
Now to create a new ridge model and to test it against the normalized feature array. I choose GridSearchCV for both ridge and lasso as it was the most optimal based on my research. It used cross validation instead of creating a new train test split case. This minimized the possibility of data leak as this algorithm will not see the actual test set.
ridge= Ridge() ridgereg= GridSearchCV(ridge, param_grid=parameters, scoring='neg_mean_squared_error', cv= 15) ridgereg.fit(feature_array, target_array) print('ridge param: ', ridgereg.best_params_) print('ridge score: ', ridgereg.best_score_) ridgereg#The answers: ridge param: {'alpha': 1e-25} ridge score: -6.683015092064162e-09#Again, sorry about the formatting.GridSearchCV(cv=15, error_score='raise-deprecating', estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001), iid='warn', n_jobs=None, param_grid={'alpha': [1e-25, 1e-20, 1e-15, 1e-10, 1e-05, 0.01, 1, 10]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='neg_mean_squared_error', verbose=0) # Finally to fit the best ridge parameters to our training set:ridgereg= GridSearchCV(ridge, param_grid=parameters, scoring='neg_mean_squared_error', cv= 20) ridgereg.fit(feature_array, target_array)#To check the actual score: ridgereg.score(feature_test, target_test)-6.836977895342504e-09 #this is a very small mean squared error.
Now to run some prediction against the unknown test case:
ridgepredict=ridgereg.predict(feature_test)
actualtest=np.array(target_test)
plt.rcParams["figure.figsize"] = (8, 8)
fig, ax = plt.subplots()
ax.scatter(actualtest, ridgepredict)
ax.set(title="Ridge Actual vs Predict")
ax.set(xlabel="Actual", ylabel="Predict");

This made me worry about overfitting and I did not understand why there was a gap between the data. So I saved the ridge scores into a dataframe and moved onto lasso.
#To save the ridge data into a dataframe
ridgedata= pd.concat([target_test, pd.DataFrame(ridgepredict)], axis=1)
ridgecomparison= ridgedata.rename(columns={0:'Predicted Ridge Scalar'})
The Lasso was computed using the same method as Ridge:
lasso= Lasso() lassoreg= GridSearchCV(lasso, param_grid=parameters, scoring='neg_mean_squared_error', cv= 20) lassoreg.fit(feature_array, target_array)lassoreg.score(feature_test, target_test)-1.3148967532843286e-05 #the score is worse than Ridge. This showed that Ridge is better for this problem than Lasso#To create the Lasso graph: lassopredict=lassoreg.predict(feature_test) plt.rcParams["figure.figsize"] = (8, 8) fig, ax = plt.subplots() ax.scatter(actualtest, lassopredict) ax.set(title="Lasso Actual vs Predict") ax.set(xlabel="Actual", ylabel="Predict");

At this point, I saved the lasso scores into a dataframe and decided to directly compare the results to the actual test score using Tableau as the visual medium:
lassodata= pd.concat([target_test, pd.DataFrame(lassopredict)], axis=1) lassocomparison= lassodata.rename(columns={0:'Predicted Lasso Scalar'})#Combining the test dataframe with the Ridge and Lasso and then saving it as a CSV to check with Tableauaccuracy_check= pd.concat([ridgecomparison, lassocomparison], axis=1) accuracy_check= accuracy_check.loc[:,~accuracy_check.columns.duplicated()] #this drops any duplicated columnsaccuracy_check.to_csv('accuracy_check.csv', index=False)# You can run the following to see if it saved properly: acc_chk= pd.read_csv('accuracy_check.csv') acc_chk.head()

This shows the accuracy of my ridge and lasso models. Since it is a bit tricky to check for accuracy among three continuous columns, using tableau, I used the COUNT function for all columns as a form of filter and then created a SUM linear graph with the Actual Test Scalar being the main component. The size shows concentration of most of the numbers and it does not really play any role in the bigger scheme of the project. However, a clear line of best fit can easily be drawn showing that the algorithms I used can be used to predict molecular scalar constant charge. This also helps explain the gaps a little bit. Majority of the scalar were rather small numbers to begin with and the few thousands that were outside the boundary probably caused the gap within the linear graph. However, we can clearly see a line of best fit.
There was one more algorithm I wanted to try out. A neural network. After trying out few different ones on a different notebook, I settles on working with Relu with adadelta as an optimizer. I chose adadelta because of the explanation from Keras documents:
learning rates based on a moving window of gradient updates, instead of accumulating all past gradients. This way, Adadelta continues learning even when many updates have been done.
earlystop=keras.callbacks.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=0, verbose=0, mode='auto', baseline=None, restore_best_weights=False)terminate= keras.callbacks.callbacks.TerminateOnNaN()adadelta= optimizers.Adadelta(learning_rate=1.0, rho=0.95)
After setting up the basic perimeters, I created a model with early stop and terminate on. Their descriptions and usage can be found in the keras documents.
m4 = Sequential() m4.add(Dense(64, input_dim=40, activation='relu')) m4.add(Dense(32, activation='relu')) m4.add(Dense(16, activation='relu')) m4.add(Dense(8, activation='relu')) m4.add(Dense(4, activation='relu')) m4.add(Dense(1, activation='linear')) m4.summary()Model: "sequential_4" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_19 (Dense) (None, 64) 2624 _________________________________________________________________ dense_20 (Dense) (None, 32) 2080 _________________________________________________________________ dense_21 (Dense) (None, 16) 528 _________________________________________________________________ dense_22 (Dense) (None, 8) 136 _________________________________________________________________ dense_23 (Dense) (None, 4) 36 _________________________________________________________________ dense_24 (Dense) (None, 1) 5 ================================================================= Total params: 5,409 Trainable params: 5,409 Non-trainable params: 0#early callback m4.compile(optimizer=adadelta, loss='mean_squared_error', metrics=['mae', 'acc']) his4=m4.fit(feature_array, target_array, validation_split=0.3, verbose=1, callbacks=[earlystop, terminate], epochs=300, batch_size=5000)
Because of early callback and terminate, the model ends rather quickly. It shows high mae and low accuracy as convergence occurs relatively early. But I know this is a mistake because of the graph below.
#with early and termination callback established
acc = his4.history['acc']
val_acc = his4.history['val_acc']
loss = his4.history['loss']
val_loss = his4.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'r', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc') plt.title('Training and validation accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend() plt.figure()
plt.plot(epochs, loss, 'r', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss') plt.title('Training and validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend()
plt.show()

Then I ran a new model but this time I did not use early-callback or terminate. I wanted to run through all epoch:
m4.compile(optimizer=adadelta, loss='mean_squared_error', metrics=['mae', 'acc'])his4=m4.fit(feature_array, target_array, validation_split=0.3, verbose=1, epochs=300, batch_size=5000)#The graphacc = his4.history['acc'] val_acc = his4.history['val_acc'] loss = his4.history['loss'] val_loss = his4.history['val_loss'] epochs = range(len(acc)) plt.plot(epochs, acc, 'r', label='Training acc') plt.plot(epochs, val_acc, 'b', label='Validation acc') plt.title('Training and validation accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend() plt.figure() plt.plot(epochs, loss, 'r', label='Training loss') plt.plot(epochs, val_loss, 'b', label='Validation loss') plt.title('Training and validation loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend() plt.show()

Now I see that both accuracy and loss are converged. However, accuracy appears to converge at a very low point and is just not a proper metric for this model. Since adadelta uses decay as a mean point, I also decided to look into mean absolute error instead of accuracy. With accuracy being such a small number and not having a proper gradient, it cannot be used.
acc = his4.history['mae']
val_acc = his4.history['val_mae']
loss = his4.history['loss']
val_loss = his4.history['val_mae']
epochs = range(len(mae))
plt.plot(epochs, mae, 'r', label='Training mae')
plt.plot(epochs, val_mae, 'b', label='Validation mae') plt.title('Training and validation mae')
plt.ylabel('mae')
plt.xlabel('epoch')
plt.legend() plt.figure()
plt.plot(epochs, loss, 'r', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss') plt.title('Training and validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend()
plt.show()

This shows much better convergence and gradient descent. If I ran for a higher count of Epoch, the mae would have most likely decreased more. But at this point, the mae was at 0.6403 with a loss at 0.6871. Although it was not as small as the Ridge model mae, it was still a descent model. However, in terms of quantitative analysis the ridge model is still the best model to use for linear regression problems that contains multiple features with millions of datasets for each features. In addition, each features’ values were sparse and some features may not even play any roles. There is also the chance that some features are causing the same effect on the target variables leading to multicolinearity and a difference in weight importance that can cause bias and overfitting in your model. Taking all these factors into consideration, I believe that using a ridge model to calculate mean error rather than accuracy is perfect for this challenge. My github url for this project will be linked below so you can take a look and perhaps improve my model.
Comments
Post a Comment
Please Share Your Views