Performance and thread-safe state handling
In my previous post I introduced using the state pattern and an AtomicReference to help maintain thread safety. My reason for starting to play with the pattern was really to make it easier to analyze the code and prove correctness, but obviously the question of performance inevitably comes up and it is interesting to know the trade-offs.
I set up a benchmark where ten threads would simultaneously try to execute the same code and each thread would just hammer away for an obscenely large number of times. First up for the test was the simple lamp:
The code to execute was:
I had an implementation with just a volatile field, an implementation with both methods synchronized, a hybrid implementation where the field was volatile and only the setter synchronized and an implementation using the state pattern. I also ran the bad non-thread-safe implementation. The results with the average time for one run through the code: volatile (669 ns), bad (697 ns), state (985 ns), hybrid (1570 ns), synchronized (2535 ns). Interestingly enough, the state pattern kicks butt. Actually the state pattern was even faster than volatile in the first tests I ran, but that is probably because I ran too few iterations to pay the full cost of garbage collection.
So how about a predicted worst case, the CoffeeMachine where the state changes on every method call?
And the test code, same rules as above, ten threads whacking away:
This was a straight shoot-out between a synchronized implementation and a state pattern implementation. Result: synchronized (5021 ns), state (5333 ns). A slight victory for synchronization, yeah!
So how about adding some getters into the CoffeeMachine, how does that change things? The test code was changed to:
With the getters present it makes sense to try a hybrid approach again. Result: state (2547 ns), hybrid (3675 ns), synchronized (4930 ns).
Conclusion: not only does the state pattern with an AtomicReference to an immutable internal state representation make it easier to manage code complexity, it is also beneficial from a performance point of view, being just slightly slower than synchronization in the worst case and twice as fast in normal cases.
Addendum: My keen colleague, the informed sceptic, pointed out to me that perhaps some more testing is needed before it can be generally said that the state pattern is better performing. He claimed that normal usage is that threads do something else and then hit the shared resource, which results in the locks being mostly uncontended. I suppose he has a point. But how to test it?
First attempt, just run my Lamp benchmark single-threaded: volatile (10 ns), synchronized (10 ns), bad (19 ns), hybrid (22 ns), state (42 ns). Oops! State pattern is 4 times slower than synchronized. Interesting also that plain old non-thread-safe code is slower than volatile and synchronized. The hybrid approach is twice as slow as synchronized. But what is going on here? We've run a long warm-up and then we're just whacking away at our Lamp from a single thread. Obviously HotSpot is able to do some nice things with the synchronized methods in this case (good to know).
Next attempt, run a minimal warmup and a short time: volatile (10 ns), bad (18 ns), state (44 ns), hybrid (63 ns), synchronized (87 ns). So now we're not hitting it hard enough for HotSpot to do all its magic synchronized is dog slow again.
Third attempt, hit the class hard from just two threads: bad (120 ns), volatile (121 ns), state (187 ns), hybrid (338 ns), synchronized (752 ns). But we're still hitting it hard. We need to figure out how to simulate the non-contention scenario with many threads.
Fourth attempt, put in a sleep in each loop. No good, there's too much noise coming in from the calls to sleep(). For what its worth, the numbers are very similar, differences are less than 1% (duh!). We need another angle of attack.
Just for a change of pace I decided to run the CoffeeMachine benchmark in one thread: state (114 ns), hybrid (160 ns) and synchronized (212 ns). Hmm, I realize I didn't really run single-threaded because there were two threads that had access to the CoffeeMachine object. One thread was just waiting for the other to finish, though, so all locks were uncontended. Trying that with Lamp I get: volatile (13 ns), bad (20 ns), state (42 ns), hybrid (65 ns) and synchronized (89 ns). It seems that uncontended locks are just as expensive as contended locks and HotSpot was able to do some powerful magic when the object was only accessible from one thread. Sanity check on CoffeeMachine running truly single-threaded: state (110 ns), hybrid (113 ns), synchronized (208 ns). Surprise! HotSpot couldn't do so much this time! I just have to check the performance against a standard non-thread-safe implementation: standard (123 ns) versus state (110 ns)!!!!
Conclusion 2: The thread-safe state pattern is excellent to use as a standard pattern. As my eager colleague points out so often, the important thing is to be able to analyze the correctness of the code and performance is usually not so important. Well, the state pattern was originally evolved to make it easier to handle varying behaviour depending on state and transitions between these behavioural states, so that requirement is solidly fulfilled. For anything but the simplest of classes, the thread-safe state pattern also seems to be most performant way of achieving thread safety and actually performs better than an ordinary non-thread-safe implementation in at least the one case tested here.
I set up a benchmark where ten threads would simultaneously try to execute the same code and each thread would just hammer away for an obscenely large number of times. First up for the test was the simple lamp:
public interface Lamp {
boolean isOn();
void setOn(boolean on);
}
The code to execute was:
b = lamp.isOn();
lamp.setOn(!b);
I had an implementation with just a volatile field, an implementation with both methods synchronized, a hybrid implementation where the field was volatile and only the setter synchronized and an implementation using the state pattern. I also ran the bad non-thread-safe implementation. The results with the average time for one run through the code: volatile (669 ns), bad (697 ns), state (985 ns), hybrid (1570 ns), synchronized (2535 ns). Interestingly enough, the state pattern kicks butt. Actually the state pattern was even faster than volatile in the first tests I ran, but that is probably because I ran too few iterations to pay the full cost of garbage collection.
So how about a predicted worst case, the CoffeeMachine where the state changes on every method call?
public interface CoffeeMachine {
void fillBeans(int beans);
void fillWater(int water);
Coffee makeCoffee(int water, int beans) throws OutOfWaterException, OutOfBeansException;
}
And the test code, same rules as above, ten threads whacking away:
Random r = new Random();
....
int w = r.nextInt(5);
cm.fillWater(w);
int b = r.nextInt(3);
cm.fillBeans(b);
try {
cm.makeCoffee(w,b);
} catch (OutOfWaterException e) {
oow.getAndIncrement();
} catch (OutOfBeansException e) {
oob.getAndIncrement();
}
This was a straight shoot-out between a synchronized implementation and a state pattern implementation. Result: synchronized (5021 ns), state (5333 ns). A slight victory for synchronization, yeah!
So how about adding some getters into the CoffeeMachine, how does that change things? The test code was changed to:
Random r = new Random();
....
int w = r.nextInt(5);
if(cm.checkWater() < w*10) cm.fillWater(w*100);
int b = r.nextInt(3);
if (cm.checkBeans() < b*10) cm.fillBeans(b*100);
try {
cm.makeCoffee(w,b);
} catch (OutOfWaterException e) {
ons.getAndIncrement();
} catch (OutOfBeansException e) {
offs.getAndIncrement();
}
With the getters present it makes sense to try a hybrid approach again. Result: state (2547 ns), hybrid (3675 ns), synchronized (4930 ns).
Conclusion: not only does the state pattern with an AtomicReference to an immutable internal state representation make it easier to manage code complexity, it is also beneficial from a performance point of view, being just slightly slower than synchronization in the worst case and twice as fast in normal cases.
Addendum: My keen colleague, the informed sceptic, pointed out to me that perhaps some more testing is needed before it can be generally said that the state pattern is better performing. He claimed that normal usage is that threads do something else and then hit the shared resource, which results in the locks being mostly uncontended. I suppose he has a point. But how to test it?
First attempt, just run my Lamp benchmark single-threaded: volatile (10 ns), synchronized (10 ns), bad (19 ns), hybrid (22 ns), state (42 ns). Oops! State pattern is 4 times slower than synchronized. Interesting also that plain old non-thread-safe code is slower than volatile and synchronized. The hybrid approach is twice as slow as synchronized. But what is going on here? We've run a long warm-up and then we're just whacking away at our Lamp from a single thread. Obviously HotSpot is able to do some nice things with the synchronized methods in this case (good to know).
Next attempt, run a minimal warmup and a short time: volatile (10 ns), bad (18 ns), state (44 ns), hybrid (63 ns), synchronized (87 ns). So now we're not hitting it hard enough for HotSpot to do all its magic synchronized is dog slow again.
Third attempt, hit the class hard from just two threads: bad (120 ns), volatile (121 ns), state (187 ns), hybrid (338 ns), synchronized (752 ns). But we're still hitting it hard. We need to figure out how to simulate the non-contention scenario with many threads.
Fourth attempt, put in a sleep in each loop. No good, there's too much noise coming in from the calls to sleep(). For what its worth, the numbers are very similar, differences are less than 1% (duh!). We need another angle of attack.
Just for a change of pace I decided to run the CoffeeMachine benchmark in one thread: state (114 ns), hybrid (160 ns) and synchronized (212 ns). Hmm, I realize I didn't really run single-threaded because there were two threads that had access to the CoffeeMachine object. One thread was just waiting for the other to finish, though, so all locks were uncontended. Trying that with Lamp I get: volatile (13 ns), bad (20 ns), state (42 ns), hybrid (65 ns) and synchronized (89 ns). It seems that uncontended locks are just as expensive as contended locks and HotSpot was able to do some powerful magic when the object was only accessible from one thread. Sanity check on CoffeeMachine running truly single-threaded: state (110 ns), hybrid (113 ns), synchronized (208 ns). Surprise! HotSpot couldn't do so much this time! I just have to check the performance against a standard non-thread-safe implementation: standard (123 ns) versus state (110 ns)!!!!
Conclusion 2: The thread-safe state pattern is excellent to use as a standard pattern. As my eager colleague points out so often, the important thing is to be able to analyze the correctness of the code and performance is usually not so important. Well, the state pattern was originally evolved to make it easier to handle varying behaviour depending on state and transitions between these behavioural states, so that requirement is solidly fulfilled. For anything but the simplest of classes, the thread-safe state pattern also seems to be most performant way of achieving thread safety and actually performs better than an ordinary non-thread-safe implementation in at least the one case tested here.
Comments