Regarding fitting of data sampled at irregular intervals

shailesh · Post by **shailesh** » Wed Jun 11, 2014 6:50 am

I have two queries:

1> If the data (voltage vs time) to be fitted is sampled at irregular intervals, does MRF while evaluating the fitness consider only the points provided (loaded from file) or does it implicitly perform a linear interpolation between the data points - and then use all these values for comparison?

2> I created a model of a cell with HH mechanism. Its gnabar, gkbar, gl, el were tweaked to get a slightly different shape. The voltage vs time plots were recorded for both fixed step as well as adaptive integration. I tried fitting data from these two cases (for a model with default HH) using MRF and I found that the parameter sets turned out to be quite different! I understand that there arises some differences in the plots from the two cases and thus different parameter sets might arise. But considering that the difference is huge, how should we decide which method (fixed/adaptive) when performing MRF? This might take more importance when handling experimental data (such as digitizing data from figures etc)?

Post by **ted** » Wed Jun 11, 2014 11:48 am

shailesh wrote:If the data (voltage vs time) to be fitted is sampled at irregular intervals, does MRF while evaluating the fitness consider only the points provided (loaded from file) or does it implicitly perform a linear interpolation between the data points - and then use all these values for comparison?

Good question. Here's one way to find out:
1. creating a toy model (e.g. passive single compartment with known membrane time constant driven by a current step)
2. using that to generate a set of test data that are sampled at irregular intervals (no more than 4 or 5 points--keep it simple)
3. setting up a "run fitness" optimization problem in which g_pas and cm are to be are adjusted so that the simulated v vs. t (generated with fixed dt so that a run involves dozens or hundreds of time steps) matches the (irregularly sampled) time course of v vs. t
4. but before doing the optimization, click on the generator's "Error" button, see what value it reports, and compare that against the sum of squared errors you would expect if the errors were evaluated only at the 4 or 5 sample points.

I created a model of a cell with HH mechanism. Its gnabar, gkbar, gl, el were tweaked to get a slightly different shape. The voltage vs time plots were recorded for both fixed step as well as adaptive integration. I tried fitting data from these two cases (for a model with default HH) using MRF and I found that the parameter sets turned out to be quite different! I understand that there arises some differences in the plots from the two cases and thus different parameter sets might arise. But considering that the difference is huge, how should we decide which method (fixed/adaptive) when performing MRF? This might take more importance when handling experimental data (such as digitizing data from figures etc)?

Multiple interesting questions here. The answer to all of them depends in part on the answer to your first question--it may be as simple as "if the 'experimental data' were sampled at a particular set of times, the model's output must be sampled at exactly the same times." If that is the case, you may want to resample irregularly sampled "experimental data" at a fixed interval, e.g. by using the Vector class's interpolate() method or perhaps by implementing a resampling strategy that uses low order polynomials or splines. Simulation results generated by adaptive integration can be captured at regular intervals by using the Vector class's record(&var, Dt) or record(&var, tvec) syntax (see the Programmer's reference about these features); if you resort to that, you'll have to provide your own error function that makes use of the regularly sampled simulation results.

shailesh · Post by **shailesh** » Thu Jun 12, 2014 5:17 am

Regarding the first question:
> I followed your suggestion and set up a similar toy model (preferred 'hh' instead of 'pas' as I had it ready). The results were interesting but left me with further queries!
The data to be fitted was a shape of an AP (with tweaked values of HH parameters) with just 3 points (just before onset, peak and after-hyperpolarization). The model (with default HH parameters) was set to run using fixed step integration. On clicking "Error Value", the following was observed:
- The voltage vs time graph plotted the entire continuous waveform for the simulated run
- The MRF Generator window originally had just a red plot (with three points) showing the data to be fitted (loaded from file). After the run, it plotted a waveform in black with just three points - and these points corresponded with the (voltage,time) value pairs on the voltage graph.
- Error Vaue was shown = 403.87
- But when I manually calculate the sum of squared errors, I get 1211.6121

Points (t, vm) from File:
5 -76.2631
5.725 43.6974
7.625 -76.8252

Points (t, vm) from Simulation
5 -64.9492
5.725 40.9602
7.625 -44.021

I am sure that if anything like interpolation was being done then certainly there would be more points and thus a higher Error Value (more terms in sum of squared errors). So I am left wondering how MRF arrived at the value of 403.87?! Quite sure I haven't goofed up, but one can never be certain I suppose.

Post by **ted** » Thu Jun 12, 2014 1:16 pm

Interesting. The three points would be connected by straight line segments, right? Try limiting the interval over which the generator evaluates the error. Try to isolate a region that lies _between_ two of your "original data" points (if you're not sure how, see this part of the MRF tutorial http://www.neuron.yale.edu/neuron/stati ... imize.html), then evaluate the error over this region. If you get a value other than 0, then maybe the data are being interpolated. I wonder if you can reduce the interval width to where it contains only about 2 or 3 solution points, which would allow you to make a quick "manual" calculation that confirms or rules out interpolation.

shailesh · Post by **shailesh** » Thu Jun 12, 2014 3:01 pm

Yes, the three points are connected by straight line segments.
I tried what you suggested about restricting regions. Firstly, I found that it was not possible to position the blue lines such that they were between two of our points of interest. I even tried using the weight panel and entering the startpoint/endpoint to achieve the same. They would either leave atleast one point inbetween or snap on to each other.

So I continued testing for restricted regions that it allowed. For all three points individually it worked fine with the MRF Error Values being same as the calculated values:
Point 1: 128
Point 2: 7.4921
Point 3: 1076.1
All is well. In these cases, the model does not plot anything on the MRF Generator graph as only point is under consideration.

But now when I try including multiple points, the trouble arises.
Points 1 & 2 -> MRF = 67.748 vs Calculated = 135.4921
Points 2 & 3 -> MRF = 541.8 vs Calculated = 1083.5921
Points 1 & 3 -> MRF = 389.86 vs Calculated = 1204.1
Points 1, 2 & 3 -> MRF = 403.87 vs Calculated = 1211.5921
(approx values). In these cases we do have straight line segments joining the concerned points.

It is weird that the sum of squared errors is less for all 3 points together than some combinations of just two points. Not sure what is happening...

Post by **hines** » Wed Jun 18, 2014 7:47 am

Since you mention "regions" I infer that we are talking about nrn/lib/hoc/mulfit/e_norm.hoc with the comment at the beginning of:

Code: Select all

error is weighted sum of normalized error in each region. Region i
has weight[i] and domain boundary[i] < x < boundary[i+1].
Normalized region error is an approximation to the integral of
(y(x) - ydat(x))^2 over the integral of x.

In func efun() in that file, for model and data that are on independent non-uniform grids, the calculation used is

Code: Select all

e = ydat_.meansqerr($o1.interpolate(xdat_,$o2), dw_)

Here, $o1 is the model trajectory y values and $o2 is the model trajectory t values.
ydat and xdat are the data y and t values. So the model trajectory is interpolated to the data trajectory and only the resulting values at the data locations are used in the "meansqerr" calculation which
is defined as "return value is sum of w*(v1 - v2)^2 / size".

dw_ depends on the region sizes and intervals between data points and is implemented in the above file in the set_w() procedure. That is certainly complicated and the implementation goal is to make
the first comment of this reply, true. Let's leave this as an open question in lieu of further code review and testing and see if the above is sufficient for you to resolve your test result differences.

I should mention that I believe the entire notion of mean square model-data trajectory difference as a fitness function can certainly be criticised, especially for action potentials. To me, there seem
to be two criteria for a fitness function for the praxis method, 1) From a reasonable starting set of parameters, there must be a path to the minimum which is all downhill. 2) the minimum is
meaningful in terms of ones judgment of what constitutes a reasonable fit of model to data.

shailesh · Post by **shailesh** » Wed Jun 18, 2014 8:06 am

Sorry, I wasn't sure what you meant by:

... to make the first comment of this reply, true.

and so wanted to clarify.

Was it regarding my first question on this thread:

1> If the data (voltage vs time) to be fitted is sampled at irregular intervals, does MRF while evaluating the fitness consider only the points provided (loaded from file) or does it implicitly perform a linear interpolation between the data points - and then use all these values for comparison?

and that, yes, MRF does indeed does interpolate between provided points to evaluate the error value?

Post by **hines** » Wed Jun 18, 2014 8:57 am

I was referring to my comment:

Code: Select all

error is weighted sum of normalized error in each region. Region i
has weight[i] and domain boundary[i] < x < boundary[i+1].
Normalized region error is an approximation to the integral of
(y(x) - ydat(x))^2 over the integral of x.

I have been looking at the implementation of the set_w() procedure and it appears that the comment is, in fact, false. From the point of view of integration, the implementation implicitly assumes the data in each region is at uniform
intervals since the weight of each data point within each region is constant. Each y(x) - ydat(x) at the x data values is given a weight proportional to the weight of the region it is in.

shailesh · Post by **shailesh** » Wed Jun 18, 2014 2:02 pm

Based on your comments:

the model trajectory is interpolated to the data trajectory and only the resulting values at the data locations are used in the "meansqerr" calculation which is defined as "return value is sum of w*(v1 - v2)^2 / size".

and

error is weighted sum of normalized error in each region

I took a second look at the values I posted earlier:

Point 1: 128 (same as calculated)
Point 2: 7.4921 (same as calculated)
Point 3: 1076.1 (same as calculated)

Points 1 & 2 -> MRF = 67.748 vs Calculated = 135.4921
Points 2 & 3 -> MRF = 541.8 vs Calculated = 1083.5921
Points 1 & 3 -> MRF = 389.86 vs Calculated = 1204.1
Points 1, 2 & 3 -> MRF = 403.87 vs Calculated = 1211.5921

I had missed the 'size' earlier, and incorporating that in calculating the error sum of squares I got:

Points 1 & 2 -> MRF = 67.748 vs Calculated = 135.4921/2 = 67.746
Points 2 & 3 -> MRF = 541.8 vs Calculated = 1083.5921/2 = 541.796
Points 1, 2 & 3 -> MRF = 403.87 vs Calculated = 1211.5921/3 = 403.864

... and the values match! It should be noted that all the above involved a single region with multiple points.

The only exception is:

Points 1 & 3 -> MRF = 389.86

This involves individual points in two different regions. We find a similar situation when three regions are defined with one point in each region:
Point 1 (region 1, weight 1), Point 2 (region 2, weight 1) & Point 3 (region 3, weight 3) -> MRF = 326.16 (Total weight 1)

These have to be evaluated as:
Error Value = ( (w1*e1/s1) + (w2*e2/s2) + ... + (wN*eN/sN) ) / ( (w1/s1) + (w2/s2) + ... + (wN/sN) )
where
wX : weight assigned to interval (region) #X
eX : error sum of squares obtained in interval (region) #X
sX : size of interval (region) #X (in ms)
N: number of intervals (regions)

The Total weight (scale) is merely multiplied to the above Error Value to give the final Error Value of the fitness function. This is useful when we have multiple generators and want to adjust the relative contribution (weightage) of each of the generators to the overall MRF optimization.

As an example:
> Region 1 (5 < t < 5.3625) -> s1 = 0.3625
Weight = 1 = w1
Point 1 @ t = 5, Error sum of squares (-64.9492 vs -76.2631) = 128 = e1

> Region 2 (5.3625 < t < 6.675) -> s2 = 1.3125
Weight = 1 = w2
Point 2 @ t = 5.725, Error sum of squares (-64.9492 vs -76.2631) = 7.4921 = e2

> Region 3 (6.675 < t < 7.625) -> s3 = 0.95
Weight = 1 = w3
Point 3 @ t = 7.625, Error sum of squares (-64.9492 vs -76.2631) = 1076.1 = e3

From the forumula, we get, Error Value = 326.15
Total weight (scale) = 1, So the Error Value returned by the generator = 326.15 x 1 = 326.15 ... which matches the earlier mentioned value!

Took me quite a while to figure out that, but once obtained it seemed so straightforward and - should I say - obvious!

shailesh · Post by **shailesh** » Thu Jun 19, 2014 8:28 am

So in context to my first question on this thread:

... does MRF while evaluating the fitness consider only the points provided (loaded from file) or does it implicitly perform a linear interpolation between the data points - and then use all these values for comparison?

I suppose we can summarize that the model trajectory is interpolated to the data trajectory and only the resulting values at the data locations are used in the calculating the error value, i.e. the number of points at which error evaluated = number of points provided in data trajectory. So, yes, it does perform liner interpolation but only if required to evaluate the model values at the desired timestamps. (Not sure if related, but procedure "set_modelx()" in 'e_norm.hoc' appears to perform linear interpolation).

As a further check to the above, if we use adaptive integration to fit the data point:

5.873875 43.9653

The closest points that the model returns are:

5.84661 38.2502
5.90114 36.4325

(I confess that the data point was chosen in retrospect)

The MRF generator returns error value = 43.877
The model does not have a value for t = 5.873875 ms and thus it linearly interpolates between the closest points t1 = 5.84661 ms and t2 = 5.90114, t = (t1+t2)/2. Similary, v = (v1+v2)/2 = (38.2502+36.4325)/2 = 37.34135
Error Sum of Squares (37.34135 vs 43.9653) = 43.8767 ... which matches the error value returned by the MRF generator!

One last question... would you have any tips on digitizing data from figures with view of using for fitting? Any do's and dont's to keep track of? I suppose the one thing that applies always is your advice to "use ones judgment of what constitutes a reasonable fit of model to data".

Post by **ted** » Fri Jun 20, 2014 4:21 pm

would you have any tips on digitizing data from figures with view of using for fitting?

Yes. the original data, if at all possible. They're obligated to preserve it and make it available--or should be according to scientific principles and stated policies of scientific journals and funding agencies.

shailesh · Post by **shailesh** » Sat Jun 21, 2014 7:18 am

Thanks. I agree, it would be best to get the raw data (whenever available) for such purposes.

www.neuron.yale.edu

Regarding fitting of data sampled at irregular intervals

Regarding fitting of data sampled at irregular intervals

Re: Regarding fitting of data sampled at irregular intervals

Re: Regarding fitting of data sampled at irregular intervals

Re: Regarding fitting of data sampled at irregular intervals

Re: Regarding fitting of data sampled at irregular intervals

Re: Regarding fitting of data sampled at irregular intervals

Re: Regarding fitting of data sampled at irregular intervals

Re: Regarding fitting of data sampled at irregular intervals

Re: Regarding fitting of data sampled at irregular intervals

Re: Regarding fitting of data sampled at irregular intervals

Re: Regarding fitting of data sampled at irregular intervals

Re: Regarding fitting of data sampled at irregular intervals