Leading Indicators in Unit Testing Implementation, Part III

Standard
This series deals with the implementation of a unit testing process in a team or across multiple teams in an organization. Posts in the series include:
GoalsOutcomesLeading Indicators I
Leading Indicators IILeading Indicators III

in the last post we talked about the failing builds trend as an indicator of success of implementation.

The final metric we’ll look at, that can indicate how our process will go, is also related to broken builds. It is the “time until broken builds are fixed” or TUBBF. Ok, I made up this acronym. But you can use it anyway.

If we want to make sure the process is implemented effectively, knowing that builds are broken is not enough. When builds break, they need to be fixed.

No surprise there.

Remember that the long-term goal is to have working code, and for that we need people to be attentive, responsive and fixing broken builds quickly. Tracking the TUBBF, can help us achieve that goal. We can infer how people understand the importance of working code, by looking at how they treat broken builds.

Sharing is caring

One of eXtreme Programming’s principles is shared code ownership, where no single person was the caretaker of a single piece of code. When our process succeeds, we want to see the corollary – everyone is responsible to every piece of code.

With small teams it’s easier to achieve. Alas, with scale it becomes harder. Usually teams specialize in bits of code, and conjure the old demon of ownership. With ownership comes blame and the traditional passing of the buck.

After all, our CI log says it right there: They broke the build, by committing their code that doesn’t have any resemblence or relation to our code. We can’t and won’t fix it. They broke it. They should fix it.

Then comes the next logical conclusion: If we didn’t break the code, we can continue אם safely commit the code. After all, we know our code works, we wrote it.

And so, every team blames the other team, committing unchecked changes and the build remains red.

(by the way, maybe they commited the last bit that broke the build, but that doesn’t mean their changes were at fault. If a build takes a long time, usually changes are collected until it starts, and it only flags the last commit, although that last one may be innocent).

Everybody’s Mr. Fix-It

One of the drastic measures we can do, is to lock the SCM system when the build breaks. That’ll teach them collective ownership.

But that doesn’t always work. People just continue to work on local copies, believing that somebody else is working relentlessly, even as we speak, on fixing the build.

Another option is to put the training wheels on. Train a team about keeping build green without interference from other teams, by developing on team-owned branches. We track the team’s behavior on their branch, encouraging them to fix the build. They are responsible to keep the build on their own branch green. Only when branch builds are stable and green, it’s ok to merge them to trunk.

The worst option, and I’ve seen it many times, is having someone else be the bad cop.

Imagine an IT/DevOps/CI master that starts each day checking all the bad news from the night, tracking the culprit, and making them, but mostly begging them, to make amends. Apart from not making the team responsible for their code, it doesn’t stop others from committing, because of the malfunctioning process.

As long as we can track the TUBBF is some manner, we can redirect the programmers’ behavior toward a stable build, and teach the responsibility of keeping it green. As we do this, we focus on the importance of shared responsibility and collect a bonus for working, sometimes even shippable, code.

Leading Indicators in Unit Testing Implementation, Part II

Standard
This series deals with the implementation of a unit testing process in a team or across multiple teams in an organization. Posts in the series include:
GoalsOutcomesLeading Indicators I
Leading Indicators IILeading Indicators III

Part I was HUGE! Now, let’s look at broken builds. We want to see a decrease in their number over time.

This may sound a bit strange. Our CI is supposed to tell us if the build breaks, that’s its job. Isn’t having more of them a good thing?

Unsurprisingly, the answer is “it depends”.

As we want the earliest feedback, the answer is “yes, of course”, the CI system serves us well. However, if we don’t see a decrease in broken builds, that may mean the CI process is not working effectively. We should investigate.

CI PI

Let’s trace the steps leading to failing builds and see if we can improve our process.

Are all the tests passing locally? If not, we’re integrating code that fails tests into the trunk. If tests are not run locally, when they run in CI builds, they will probably fail too. That’s a big no-no. We may even find out the tests are not even run locally, and we’d want to improve on these behaviors.

If tests do run and pass locally before they are committed, there might be another problem. That may point to issues of isolation. If they pass locally, tests that depends on available resources in the local environments, find them there. But at the CI stage, they don’t and fail. More broken builds, indicate the team has not learned how to write isolated tests yet.

There might even be a bigger issue lurking.

Trust and accelerate feedback

We want to trust them on the CI environment, but since they “work on our machine” and not on the CI, these tests just got a  trust downgrade. This can have a weird counter effect on our way of running them.

Since results we trust run on the CI, and local runs are creating confusing results, we may stop running tests locally at all, and run them instead on the CI server making sure they run correctly there. When we do that, we make our feedback cycle longer, but more importantly, we risk the tests failing for the right reason, but holding the rest of the team hostage until they are fixed.

To get the right feedback early, we need to get back to running tests locally.

We want to increase the number of isolated tests, so they can be run locally, and can be trusted to fail on the CI server. Isolated unit or integration tests failing before committing is the first line of defense.

Then, we want to be able to run the non-isolated tests either locally or in a clean environment as we can manage. The point is to not commit code until we trust it. This may require changing available environments, modifying the tests to ensure cleanliness, pre-commit integration or any combination of those.

Can you believe all these improvement opportunities come from a single indicator? The deeper we dig, and more questions we ask, we can find opportunities for improving the process as a whole.

We’re not done yet.

Implementing Unit Testing – Leading Indicators (part 1)

Standard
This series deals with the implementation of a unit testing process in a team or across multiple teams in an organization. Posts in the series include:
GoalsOutcomesLeading Indicators I
Leading Indicators IILeading Indicators III

Now that we’ve talked about what we want to achieve, we better spread out some sensors. After all, we’re about to embark on a long and winding road. We want to know if we’re still going the right way, and if not, make a turn.

Leading indicators raise the red flag before the worst has happened. In our case, we don’t want to check in after six month and see no one’s written any tests for the last three months. We’d like to know sooner.

So what should we track?

If you’re thinking “coverage”, hold your horses. Coverage by itself is not a good progression metric. We sometimes mix simplicity of measurement with effectiveness, and coverage is one of those things.

Coverage as a number doesn’t have context – are we covering the right/risky/complex things? Are we exercising code but not asserting (my favorite gaming process)? Are we testing auto-generated code? Without the context, the coverage number has no applicable meaning.

Before we mandate tracking the entire software coverage, or setting a coverage goal, remember that while it is easy to track exercised code lines, the number doesn’t mean anything by itself.

So what do we look for?

The first metric to start tracking is the simplest one: Number of tests. (Actually two: tests written and test running).

Our first indicator if people are writing tests, is to count the tests. For that, we need to require a convention of test location, naming, etc. Guess what? These are needed also to run them as part of a CI build. Once everything is set up, we can count the tests where they count (pun!).

Then we want to look at the trend over time. The number should go up, as people add more tests. If the trend flatlines, we need to investigate.

One thing about this metric – if you never see a drop in the number of tests, something’s wrong. That probably means tests just stay there, and are not getting re-reviewed, replaced or deleted. In the short term, starting out, we want to see an upward trend. But over the long haul, code changes, and so do the tests. We want to see at least some fluctuations.

So what about that coverage?

Ok, we can measure coverage. We get that out of the box, right?

But we need to know what we’re measuring. Coverage means executed lines of code. So we can look at coverage (or lack of) as an indicator in the context we care for.

That could be any of the following:

  • Important flows in the system
  • Buggy code
  • Code we return too over and over
  • New component we want to cover

Or any other interesting bit. When we measure coverage, we want to see a trend of increasing coverage over these areas.

Now, how do you manage that? That requires more tricks, since we want to make sure we measure the right code, and the right tests. If the code architecture already supports it, it’s easy: 75% of a library for example.

If, however you want to measure coverage of a set of classes, excluding the other parts of the library, that requires more handling and management. Usually people don’t go there.

The funny thing is, the more specific you want to get, the less regular tools stop helping. And the broader numbers lose meaning.

By the way, the coverage metric should go away once you get to sufficient coverage. Once again, it’s not a number, but what we want to achieve – stability over time, or regression coverage. We can stop measuring then (and may be look at other areas).

Ok, we’ll continue the indicators discussion in the next post.

 

 

Over Exposure and Privacy Issues

Standard

This conversation comes up in every training I do about testing. I know exactly when it will happen. I know what I’m going to say, what they will say, and I know no one comes out completely convinced.

Then violence ensues.

We want to test some code. However, this code had started being written in the age of the dinosaurs, its size resembles some of the big ones, and the risk of changing the code is the risk of feeding a Jurassic beast.

There’s no point ignoring it, we’ll need to modify it in order to be able to test it. In fact what’s really bothering us is that some of the information in there needs to be accessed. I suggest changing the visibility of some members.

This escales quickly

Curious person: “What do you mean make it public”?
Me: You know, make it public. Or add a method that will enable us to set up the class easily, so we can test it.
Curious person: “But that means breaking encapsulation”.
Me: Might be. Is that wrong?
Mildly suspicious person: “Well, the design is there for a reason. The member is private because it needs to be private, we can’t expose it”.
Me: Remember that tests are first-class users of the code? That means that the design can change to support them too.
Mildly angry person: “But you can’t expose everything, what if somebody calls it”?
Me: That’s the point, that the test can call it.
Angry person: “But you can’t ensure that only tests will call it. At some point, someone will call it and create hell on earth. We’ve all been there”.
Mildly angry Me: Yes, it might happen, and we can’t predict everything and plan for everything, software is a risky business.
Very angry person: “THESE RULES WERE WRITTEN IN BLOOD!”

The design perspective

Let’s ignore the fact that you’re defending a horrible untested design, with the argument that changing the accessibility of one member will make it so much worse.

Encapsulation is a great concept. Having public and private members makes perfect sense. Some things we want to expose, others we want to hide.

But the water becomes murky very quickly. If we were talking black and white, only private and public, that would be ok. But for those special times, when private is just not enough…

There’s “friend” (C++).
Or “internal” (.net).
Or implicit package visibiliy (Java).
Not to mention “protected”.
Or reflection that bypasses everything.

It’s like we’re breaking encapsulation, but not really breaking it, so we feel better.

Let’s face it, design can serve more than one purpose, and more than one client in different ways. The language giving us tools doesn’t solve all the problems. That’s the cool thing about software, it can be molded to fit what we need it to do.

That involves living with not-so-perfect designs. It also means “changing the code just for the tests” is ok, because that’s another good purpose.

BUT SOMEBODY WILL CALL IT AND PEOPLE WILL DIE

Yes, somebody might call it.

That somebody might also look at the code and decide to change it, or add an accessor themselves.
It’s silly, not to mention risky, to think that a language keyword is the only thing standing betweend us and a disaster.

In C everything is public, how do you prevent the “someone might call it” disaster waiting to happen there? Playing with .h files and custom linking can be a lot more risky.

We do that the way that’s always more effective than the code itself: Processes that involve humans, not relying on tools.

We need to understand the purpose of code – create value for someone. If the code is beautiful but cannot be tested,  you don’t get points for using the “right” encapsulation.

The value comes when functionality works. In order for it to work, we need to check it. If we discussed the risks involved, and decided value comes from testing, it usually outweighs the risk of “someone might do something bad with the code” (after many have abused it already).

And if you feel this risk is not negligable – do something about it. Do code reviews, document and share the knowledge, create architecture guidelines.

But don’t rely on a language feature as the first resort.

Unit Testing Anti-Pattern: Prefixing Test names With “test”

Standard
This series goes through anti-patterns when writing tests. Yes, there are and will be many. 
TDD without refactoringLogic in testsMisleading testsNot asserting
Code matchingData transformationAsserting on not nullPrefixing test names with "test"

History teaches us that old habits are hard to break. Much like the proverbial monkeys who haven’t been there for the first electric shock, but still wouldn’t get near the banana, the habits are even harder to break because “that’s what we do here, we can’t explain why”.

People are still writing tests that contain “test” in the name, and you know why? Because that’s how they do it here. And it makes sense, right? It’s a test!

Which begs the question: we know it’s a test, but how does a test framework know it’s a test? What makes a test different than any other method?

The olden days

In Java, when JUnit started to take shape, and later in .net and other reflection supporting languages, before annotations were supported yet, the test framework needed to identify a test in some way, and until JUnit 3, that way was to start the test method with “test”. Convention over configuration. Beautiful.

Since JUnit 4 that took advantage of Java annotations (@Test), the requirement for a test to start with “test” was no longer needed. In NUnit, and other .net framework, attributes played that part.

In other languages that didn’t identify tests using reflection, things got more interesting.

Identifying the tests using reflection, saved the need for test registration. If you look at many C/C++ older frameworks, you’ll see the registration part, for example in CppUnit:

We manually need to register each test in the test suite.

There was no way the framework could diffrentiate between a test and another function automatically. In more modern frameworks, the registration takes place by macro implementation for you, so there’s no need to go find tests, and tag them properly.

By the way, another advantage of prefixing the test with “test”, helped with collecting and identifying the tests for registration. That’s not needed anymore either, due to the automatic registration done by macros.

To summarize, with modern frameworks there is no need for the identification of tests by prefixing them.

Furthermore, if locating tests is your issue, modern project conventions help with that more than a name can. Maven projects, for exmple, tell you where to place and find the tests. The project file architecture has much more to do with identifying and running test than anything else.

And so ends our history lesson, and you probably want to ask, again: Then why don’t people stop doing that?

Apart from the “that’s what we do here” part, I mean. They don’t have a good explanation.

Still there’s a good explanation to why you should drop the “test” prefix.

Visual noise

Code is complex enough already. Adding more code simply distracts us from understanding it. It’s true with big classes and methods, and it is also true with tests.

I always encourage to write descriptive test names (and with regular methods too). That means they get long names. Which is ok if they contain the information for better description of what the test checks, and at what context.

Obviously, having the word “test” doesn’t help me understand it better. It’s just noise, a distraction from the important bits.

This is just low noise. Annoying, but bearable.

But when you look at a CI report, and everything starts with “test”, it’s no longer low volume. It takes up a very big portion of the screen and information. Again, Everything on the test part of the CI report is obviously tests. The noise was amplified and distracts us further from identifying sibling tests and their results, looking for specific tests, etc.

In a nutshell: Prefixing tests with “test” doesn’t help anyone (unless maybe dealing with legacy test frameworks), and can even cause delays when looking to solve problems with failed tests.

Just don’t do it.

Unit testing Anti-pattern – Not Asserting

Standard
This series goes through anti-patterns when writing tests. Yes, there are and will be many. 
TDD without refactoringLogic in testsMisleading testsNot asserting
Code matchingData transformationAsserting on not nullPrefixing test names with "test"

You’d think that we don’t need this kind of post in our day and age, but I still see tests written without the assert part.

This pattern, as you’d expect, appears with beginners. However, when I ask, they usually know that they should assert.

So why do they still write tests like that?

A test without an assert (containing only an action with a possible setup) makes sure that the process works without an exception. In any other case, the test passes.

Crashing isn’t everything

I think that we can agree that knowing the code doesn’t crash is valuable.

In fact, if we expend the unit test into a wider scope (like integration, or an end-to-end test), it tells us that not only the functionality doesn’t blow up. The wiring, configuration and the whole path doesn’t blow up either. This is useful to know.

And usually this is where tests without asserts start.

We’re talking about beginners. They use their new found execution engine (read: unit test framework) for running a process, and see that it doesn’t break. The new tool helps with identifying that.

There is a deeper issue, though.

Beginners who start unit testing, usually don’t immediately rely on the tests as their only proof of working code. So the tests are written, after the code has proved to be working by running the application, or by other tool.

The process is: Write the code, check it works, and then write the test. When we write the test, we already know the code works. Now we have a test for it, one that doesen’t crash.

“Ok, boss, here’s the test you wanted me to write, and it proves that my code works, happy?”

Not only is the test writing extra work, now we need the assertion on top of it. Even more work.

You know what comes after the boss looks at the code, and mentions that the test doesn’t have an assert? Asserting on not-null. It’s the easiest to add.

And this is how we get working tests without assertions that are considered “ok”.

The right way

Much like assertion on a non-null return value, we can do better. While there is value in that kind of test, we value more having a specific assert that tells us where the problem is.

The value of the assertion, is that we specify the behavior we expect. Experienced developers start with how they will check the behavior, before setting up the test.

Even if it’s just one side effect of the operation, the assertion tells us what to look for, and what code caused that effect. Using triangulation we can pin point the problem and fix it quickly.

If we write the assertion w.e can learn that the code needs to expose information for the tests, and possible other clients. It can help with making code more usable.

If tests don’t have an assert, or if the assertion is on non-null, ask “how do you know it works?” in the context of the test. Once this question is answered, you can craft the assert accordingly.

 

 

 

Unit Testing Anti-pattern – Asserting on Not Null

Standard
This series goes through anti-patterns when writing tests. Yes, there are and will be many. 
TDD without refactoringLogic in testsMisleading testsNot asserting
Code matchingData transformationAsserting on not nullPrefixing test names with "test"

Catchy little title, right?

Once people understand they actually need to write an assert, this pattern appears almost immediately. (And for those who think: Who doesn’t write an assert, wait for the next post.)

Consider this test:

The conversion works, obviosuly. The test proves it.

First let’s discuss the acceptance criteria. Is a null result indeed a symbol of failure?

Because, if it’s not, and the code always returns a non-null value (for example, if using the Null Object Pattern), then we have ourselves a test that always passes, and that’s not much of value.

But for the sake of argument, let’s assume that null means failure.

So it’s ok to check for it’s non-null-ness, right?

Yes.

And, also not really interesting.

Because the client code probably is more interested with the non-null returned value. It will probably do something with it, like:

Because that’s what we do with results of functions, we use them for something.
Now, if the client code is going to use that, why not the test? The test is a primary user of the code, just like production code. Then why not write the test like:

Well, sure, that’s better.

But what if it actually returns null?

If the result is null, we’ll get a nice NullPointerException (java here, but depends on your language of choice), and that exception would point to the exact line where the assertNotNull would point. It’s like magic.

So we didn’t lose information, and instead, gained the confidence on how the production will acutally going to be using our object works.

My guess – we’re better off.