Conformal Prediction with Parameter Optimization Loop

Haven't been updating for a while, but made conformal prediction up and running.

f:id:hateknime:20200906151703p:plain

while making this flow, found a part of KNIME which was quite difficult to overcome

f:id:hateknime:20200906151734p:plain

Loop in another loop made a same name flow variable. In this case "iteration".

f:id:hateknime:20200906151808p:plain

so, without anything, the loop stopped with an error "same flow variable".  So I passed on the flow variable from conformal prediction loop to parameter optimization loop. This is not the nicest way to do this... I wonder if there is "rename flow variable" kind of node or another way to do this. But Google-ing said there was none.

 

Never mind, Conformal prediction here gave nice result saying some values are unpredictable which is the first case in any prediction workflow. However, the dataset was quite 1:1 distributed so didn't make such a large difference. I would imagine with skewed target dataset, this could be a nice way to predict.

 

Also, many updates in KNIME recently. Hope Graph Convolution comes out soon too!

Feature selection in KNIME

There are many ways to select features for the model calculation. Previously I posted using randome forest importance.

hateknime.hatenablog.com

 

This time I used the "Feature Selection Loop".

f:id:hateknime:20200708175009p:plain

Using this loop, you can Forward select, Backward Select, Random select, AND GENETIC ALGORITHM select. What!? GA??? Was this there before in KNIME ver3?

GA is one of my favourite selection technique (not saying it's the best though) because it seems to select features different from the other techniques. So here are the options

f:id:hateknime:20200708175159p:plain

In the middle, you just need to select from Fature selection strategy, Genetic Algorithm. I may use upper bound features too depeding on the sample size I guess.

Let it run with my AMES data and got this (sorted by Cohen's kappa).

f:id:hateknime:20200708175840p:plain

So it shows on each rounds, what the chosen score was as well as the features used. Sweet, I can probably play around to see which features contributes more and/or just use the best model!

 

Multi task neural network in KNIME using custom loss function

In many cases, you'd have objective values being missing. That is, there are Yes and No values but also 'I don't know' values (hence null). But following the trend of neural network and deep learning, I wanted to do multi-task learning to see how well or not it performs.

 

One way to overcome this is to mask the null value by some other value and don't include it in the learning calcuatin (mask them). So here is what I did in KNIME

f:id:hateknime:20200607182612p:plain

 

This is very similar to my previous NN work flow but here is addition.

f:id:hateknime:20200607182637p:plain

mask the null value (? in knime) by -10

then made custom loss function like this.

f:id:hateknime:20200607182710p:plain

Very easy  but fun to see what happens. So I used toxicity database from MoleculeNet and found out that the data is very skewed so lots of NOs and few YESs hence not so much difference between non-multi-learning and multi-learning. Hmmm, really in need of good dataset but I guess real dataset will probaly do.

 

This was so much fun to test and see it actually works (no improvement, but learns at least). Wonder what happens in real and different datasets.

 

Also looking into conformal prediction lately. KNIME online seminar was owesome and they already have the nodes as well. Such a nice program.

R-group viewing using KNIME

Previously I had a first trial of doing R-group analysis using KNIME and got a nice table.

hateknime.hatenablog.com

But didn't try into detail about viewing this using KNIME because I didn't know how to transfer SMILES into nice images to final tables (so used other commercial tools from here). Few days ago though, there was a really nice talk from KNIME online summit on using RDKit.

https://www.knime.com/about/events/extended-knime-spring-summit-online-apr-2020

 

There, found a nice code using java snippet node on transferring SMILES into URL encoded images. I won't put it up here. Samples are available too so I think it will be really nice to try out yourself the nice R-group analysis workflow (SO SO SO much better than what I put up previously).

So many docs, samples, codes available in KNIME community is so useful!

Getting sample datasets from DeepChem/PDBbind to play around from home

Ok, stay home, play around from home...
NO DATASET TO TRAIN!!! AHHHHH!!!!

 

Then comes the nice deepchem dataset with variety of sample dataset. Or there is Binding DB and so on.
https://github.com/deepchem/deepchem#citing-deepchem
http://www.bindingdb.org/bind/index.jsp 

 

I like how it (DeepChem) has PDBbind (not the latest, I think),
http://www.pdbbind-cn.org/
and wanted extract data from certain type of protein. So here is the workflow.

f:id:hateknime:20200421135538p:plain

 

Also like how fixed width file reader can specify the length of width for each column.
In the top middle part about value counter and sorter, I looked at what kind of protein has many data. Beta secretase 1 seems to have many so I extracted PDB by copying the HIV pdb from the whole list of pdb dataset (in the middle loopind section). String manipulation is the part where I made file URL to get the files like this!

f:id:hateknime:20200421135641p:plain

 

Then obtained their Kd values at the bottom row of the workflow.

f:id:hateknime:20200421140151p:plain

sweet dataset! Machine learning of this dataset maybe fun. Also wondered how to treat Kd, IC50, Ki and so on so I googled and found the same question. 
https://github.com/deepchem/deepchem/issues/715

Treat them the same to start off with but use them with caution I guess.

 

 

Active Learn using Density as well

Previously I posted active learning example.

hateknime.hatenablog.com

 

and did say I'm going to try with density label so here it is. Took a while with all sorts of  problems but I had it in the end. 

f:id:hateknime:20200412175535p:plain

 

I plotted what happens with AMES dataset.

f:id:hateknime:20200412175620p:plain

 

steep rize (y-axis is accuracy, x-axis is adding 10 samples each round) then linear rize upto 100% accuracy (well, overfit...)
so looks like around 0.80 is the target or maybe more for the actual model usage. I should add validation dataset so I can actually measrure the true accuracy but this dataset is difficult in a sense that it is actually quite sparse.  Real life dataset will definitely be interesting and things to try at work.

Also really like the Exploration/Exploitation Score combiner (1-x) for exploration reduces math formula node away! haha!

 

As for clustering lots of data, unfortunately seeing the limit here and trying with GPU assisted way and others using python. I will probably post how that goes too.

 

Data science/chemistry is so much fun! but then do think that molecular simulation is another dimension fun too!!

 

Simple Ensemble Model using previous KNIME workflows

I posted making Random Forest, XGBoost, SVM Models previously.

hateknime.hatenablog.com

 

 

So how about combining them all together. Here is what I did.

f:id:hateknime:20200330220510p:plain

Top part is all learning part that can be copy and pasted from previous posts. I added save model in the end because I wanted to stop before going to the bottom part. Bottom part adds the probability of active/inactive given from each models and decides from addtion of three models using Math formula.

 

Very Simple and easy to make using KNIME again. Kaggle's ensemble models takes like hundreds of models so this is just a easy integration. But the things are getting bit complex here. In that case, using "Metanode" helps. Simply choose the nodes to group, right click, make them into metanodes. To get this!

f:id:hateknime:20200330220521p:plain

Programming is all about input and output (IO) and this metanodes tidy up the space but also allows you to see the IO more easily. Another great option from KNIME!


 

Oh the result, Accuracy for Random Forest (0.71), XG (0.78), SVM(0.79) and Ensemble was 0.8. Wait, it's quite different from previous result... Partitioning maybe? But why is Random Forest so low...