This work was discussed by the author in the Haskell Interlude podcast as well [0]. Highly recommended and probably easier to digest than a whole dissertation.
I have used Python's `hypothesis` as well, and I wish it were better. We had to rip it out at work as we were running into too many issues.
I have also used Haskell's `QuickCheck` and Clojure's `spec` / `test.check` and have had a great experience with these. In my experience they "just work".
Conversely, if you're trying to generate non-trivial datasets, you will likely run into situations where your specification is correct but Hypothesis' implementation fails to generate data, or takes an unreasonable amount of time to generate data.
Example: Generate a 100x25 array of numeric values, where the only condition is that they must not all be zero simultaneously. [1]
I understand your pain in some sense, but on another I feel like people with a decent amount of hypothesis experience "know" how the generator works and would understand that you basically _never_ want to use `filter` if you can avoid it, instead relying on unfalsifiable generation.
Silly idea for your generator would to generate an array, and if it's zero... draw a random index and a random non-zero number and add it into the array. Leads to some weird non-convexity properties but is a workable hack.
In your own example you turned off the "data too slow" issue, probably because building up a dataframe (all to just do a column sum!) is actually kind of costly at large numbers! Your complaint is probably actually meant for the pandas extras (or pandas itself) rather than the concept of hypothesis.
No, I ran into the same issues with basic data structures. The dataframe wasn’t necessary, it just matched the expected input of some function I wanted to test.
I took your case, I got way better perf just generating a list of numbers and then reshaping it into a dataframe.
But! Even though it doesn't even get that much slower at a certain number of rows it just starts hanging! Like at 49 rows everything is still fine and at 50 it no longer wants to work. It's very bizarre and I'll see if I can debug it. But I think your test case isn't indicative of some fundamental issue with Hypothesis rather than some sort of bug.
That kind of behavior can happen at the threshold of Hypothesis' internal limit on entropy - though if you're not hitting HealthCheck.data_too_large then this seems unlikely.
Let me know if you have a reproducer, I'd be curious to take a look.
Not weighing in on any particular tech (hypothesis or otherwise), but intrigued by your example...
My initial impulse is to pick a random cell which must not be zero, generate a random number for each other cell and a random non-zerp number for that one. I'm not immediately decided on whether it's uniformly distributed.
I would pick the number of non-zeros first, assert that it's non-zero, then continue filling in the values themselves. And probably not with a uniform distribution.
Any algorithm that cares about the number of non-zeros could have non-trivial interactions with their arrangement and count, so picking something that generates non-trivial sparsity (and doesn't just make the array look like white noise) is going to have the best chance of exposing interesting behavior. The tricky part is thinking through how to generate "interesting" patterns, which admittedly I haven't put enough thought into.
Ah, yeah, generating for property testing probably doesn't want a uniform distribution. What patterns are interesting will surely depend on what we're doing with the array.
Right! And even if it were, in the sense that that's what we should expect as real world input, it wouldn't generally be the best distribution for finding bugs.
(a) Filtering is a last resort and is best avoided. As an example, the Gen type in Haskell's falsify package can't be filtered, since it's a bad idea. As another example, ScalaCheck's Gen type can be filtered, but they also allow "retries" (by default, up to 10,000 times), because filtering is very wasteful.
(b) If you're going to filter, scope it to be as small as possible (e.g. one comment points out that you're discarding and regenerating entire dataframes, when the filter only depends on one particular column)
(c) Have some vague awareness of how your generators will shrink, to avoid infinite loops. In your case, shrinking will make it more likely to fail your filter; and the "smallest" dataframe (all zeros) will definitely fail.
Care to expand upon the issues you were running into with hypothesis? I'm genuinely curious as I may soon be evaluating whether to use it in a professional context.
As far as I'm aware, Hypothesis is fundamentally based around the idea of "generators are parsers of randomness" discussed in this paper; i.e. a Hypothesis "strategy" is essentially a function from bytestrings to values. To generate random values, those strategies are run on a random bytestring; to shrink a previous value, the bytestring that lead to that value is shrunk.
Haskell's "falsify" package takes a similar approach, but uses a tree of random values. This has the advantage that composite generators can run each of their parts against a different sub-tree, and hence they can be shrunk independently without interfering.
For structure generation I prefer Doug McIlroy's approach: pick a tree size (from some arbitrary distribution), and then, of the n possible valid structures of that size, produce the kth one uniformly.
Not read the whole thing yet, but the idea to represent generators as free monads which can be interpreted in multiple ways is very promising. In particular, this may be a way to unify various generator implementations, like the random generators from QuickCheck, hedgehog, falsify, SmartCheck, etc. with the enumerations provided by Smallcheck, FEAT, LeanCheck, etc. and maybe even the weirder ones like LazySmallcheck, extrapolate, fitspec, speculate, lazy-search, spectacular, the various logic-programming implementations of property-checking, etc.
In principle, the same framework would extend to parsers too (atto/mega/parsec, etc.), though parsing is often a bottleneck in real applications, so many would avoid such abstraction/indirection. Still, it would be nice if libraries could define parsers for their datatypes, in such a way that it's easily reusable as a generator in tests (without having the library package depending on any test-generator packages).
They also demonstrate finding the "derivative" of a generator (i.e. the result of fixing its next choice to a particular outcome), which is useful for avoiding choices that cause preconditions to fail (something which many PBT users seem to struggle with: assuming that the framework will somehow avoid problematic values, rather than just rejection-sampling them as most do).
Just going to plug the excellent .NET PBT library, CsCheck [0]. I have used it quite a bit to excersize strange corners of my program logic and found several exotic bugs with it.
Property testing is a lot like formal methods - really cool, but almost entirely useless in ~95% of commercial contexts.
They're both extremely useful when, say, building a parser, but when the kind of code you write involves displaying custom widgets, taking data and pushing it onto a queue, looking up data in a database, etc. integration tests have a lot more bang for the buck.
For "business applications," I much prefer the variant of property testing called model-based testing. This is where your property is "does the implementation behave like some simplified model." This is the correct way to test such applications in my opinion, because users don't care about "custom widgets," "queues," or "databases." They care about business functionality, which is quite easily expressible in model form.
I don't do much mocking either, because I also prefer integration tests for these types of applications. Here is an example that tests against a database, all the way up to the state management layer of the UI: https://concerningquality.com/model-based-testing/.
In fairness, people don't write about doing this a lot. PBT literature seems to revolve around sorting functions or binary tree operations. But PBT much more powerful than just that.
>in my opinion, because users don't care about "custom widgets," "queues," or "databases."
Users care quite a lot when these things break.
>They care about business functionality, which is quite easily expressible in model form.
They care about reliable applications that do the task they want. Users don't have a concept of "business functionality". If a widget they wanted to use is broken, they'll be pissed - they won't tell you that it's ok because the concept of the application was still good.
From your link:
>The main thing that I want to highlight here is how simple the model is. This is pure domain logic, and is probably as close to the essential complexity of the application that we can get
This is exactly the issue.
If I go down my recent list of bugs and categorize them, ~8 out of 10 of them are in areas which the author would have *abstracted away*. The other two probably could have been avoided by having a team member write one test to cover that functionality instead of zero.
What does it mean when you abstract something away from a test? Well, it means the test won't catch a bug in that thing.
If I went down my recent list of bugs and been able to see that 9 out of 10 of them would have been caught via adding property testing/model testing on top of regular testing/typing, etc. I'd have absolutely no second thoughts about implementing it but that simply isn't the case.
I have worked on problems in (not usual) environments where it is the case, and in those rare cases I have advocated for and used property testing but I have more often seen it advocated and used by people who see every bug as a nail and property testing as some kind of super advanced hammer.
Users care when their expected behavior breaks. They certainly do not care why it broke, or _where_ it broke. Most users don't know what a database is.
> If I go down my recent list of bugs and categorize them, ~8 out of 10 of them are in areas which the author would have abstracted away.
Even though the model is abstract, all of the behavior of the implementation is covered in a model-based test. For example, let's say a SQL query is what's broken for some use case. This query would be exercised in the test, and it wouldn't pass until the query is fixed. That's the main reason I prefer model-based integration tests - there are no such bugs that escape.
Also, I still write targeted tests. There's nothing that prevents you from doing that, and it absolutely is useful in many cases. I just believe that most tests should be generated, and hand-written tests should be the minority.
Hah! Try to separate your domain logic from your interfaces (e.g., using something like hexagonal architecture) and then say this again.
Yes, it's a lot of work coming up with good properties, but it massively helps to find gaps in the domain logic. In my experience, these gaps are what's typically expensive, not the weird problem a junior had with properly using Redis or S3.
>Hah! Try to separate your domain logic from your interfaces
Im not an amateur.
The only time I dont do this is when there literally is no domain logic yet (e.g. a CRUD app).
>In my experience, these gaps are what's typically expensive, not the weird problem a junior had with properly using Redis or S3.
What can I say? Your experience might not be as broad as mine.
Redis is a source of almost no bugs because it is very well designed, but most interfaces I couple to have design qualities that are the exact opposite of redis's.
Those interfaces (e.g. wonky payment gateway APIs, weird microservice APIs) are the probably source of most bugs in enterprise systems I work on.
#2 is probably simple misspecifications (customer said code should do X, it should actually do Y which is almost the same but very slightly different).
#3 would be domain logic errors. And even most of those are uncovered and avoided with saner architecture or a couple of unit tests.
For the parsers I write at home, sure, property testing kicks ass. For your college degree algorithm coursework, sure, it helps a lot. For 95% of business logic? Pointless, the complexity isnt buried deep in the business logic.
It's very useful when you're working on numerical software. Often, it's hard to figure out exactly what output your code should return (because if you knew the answer you wouldn't have to write the code), but you can easily list properties you expect.
I agree, but this is a good thing! My default approach these days is functional tests for everything possible, and property based tests for anything particularly algorithmic or containing lots of edge cases, and no unit tests outside that. This is a great combo, covers all the business value without leaving obscure bugs, and also isn’t a pain every time you refactor.
I've found it effective for anything that handles arbitrary input, especially from end-users. But if that data is coming from within your systems where you have full control over everything, less valuable.
It's useful for things that would be too painful to enforce via types. For example, "this list should be non-empty" is pretty easy to enforce via types (use a pair of head element and possibly-empty tail list). On the other hand, something like "map of non-overlapping date ranges to counts, with automatic splitting on insertion" would be pretty tricky. It's easier to state the desired properties as universally-quantified functions rather than types, at which point we might as well throw them into a few property checkers to see if they find anything we didn't think of.
If the arbitrary input is, say, a text box which takes a name and puts it into a database, it probably won't uncover any bugs.
It has some use if you build something like a complex pricing engine, numerical code or a parser for a mini DSL. I find that problems of this type don't crop up a lot though.
I've used it for things like "regardless of where you are on the page, tab n times and shift tab n times leaves you on the original item".
I found a bug in our tv ui library which was actually a bug in the spec. Regardless of how you built the ui, if you press a direction and focus moves, pressing the opposite direction takes you back - but we had another rule that broke this. We had tests for both, and it was only when I made the general test (for all ui, items in it and directions) it found the inconsistency.
It was also pretty easy to write.
I've also found issues around text processing due to lowercasing not always resulting in the same length string and more. I found a bug demoing pbt for a contact gig I was going for that was around some versioning.
To be honest I've never implemented it, even for a demo, and not found a bug.
> To be honest I've never implemented it, even for a demo, and not found a bug.
Me too. I tend to roll my own property testers / fuzzers per project instead of using a library. But my experience is similar to yours. Out of maybe 25 testers, I think the only times I didn’t find any bugs was when I messed up the tests themselves.
I used to build these too, my example was one in actionscript (always fun on restricted devices) and it was a bug in a library I wrote. I was actually testing the property tester and found the bug then, after realizing it wasn't a bug in the new test/tester.
> It’s incredibly humbling work
Absolutely. Glad others feel it too.
I almost feel like you need a decent reason why not to use them in places.
Ideally to can describe the behaviour of something in general terms around what stays true. If you can't, is it too hard for people to understand?
I will however say I don't usually write pbts but have a strong feeling I should add more.
The Choice Gradient Sampling algorithm shown in the paper could be used to integrate/exploit feedback (coverage, etc.).
Hypothesis can use a fuzzer as its input source, but it's marked as experimental.
It doesn't seem like "proper" fuzzers are used very often with Haskell (e.g. there's an example at https://www.tweag.io/blog/2023-06-15-ghc-libfuzzer/ ), though many people seem to have hooked up the HPC code coverage tool to property checkers, e.g.
Jane Street uses OCaml and property based tests are easiest when dealing with pure functions, and are taught in FP classes usually, so I assume it’s that. Easier to setup and target audience.
Edit: also a numerical domain, which is the easiest type to use them for in my experience!
The same reason Google burns $50M+ in electricity each year using protobufs instead of a more efficient format. An individual company having specific needs isn't at odds with a general statement being broadly true.
How’s that comparable at all? There are no network effects from writing property based tests, people use them if they are helpful - are they testing enough of the code with reasonable amount of effort. Nobody’s forcing people to write tests, unlike Google forces usage of protobuf on all projects there.
It's comparable in the way described in sentence #2:
> An individual company having specific needs isn't at odds with a general statement being broadly true.
Google needs certain things more than reduced carbon emissions, and Jane Street needs certain things more than whatever else they could spend that dev time on.
Fine but cutting the thought process at "it depends" is not a great way to understand what's happening here. You can explain anything happening at any company by saying "they need certain things more than whatever else they could spend that time on".
Why is PBT useful at Jane Street, at least more than in other places? Is it the use of functional language? Average Jane Street dev being more familiar with PBT? Is the domain particularly suited to this style of testing?
Explicitly, my claim is that the biggest bottleneck is education on how to use PBT effectively and Jane Street is not using them to get an extra mile in safety, they use it because it's the easiest way to write large chunk of the tests.
>Why is PBT useful at Jane Street, at least more than in other places?
Because trading firms write a lot more algorithmic code than most businesses. Trading strategy code is intensely algorithmic and calculation heavy by its very nature as is a lot of the support code written around it.
At least, that's what it was like when I worked in a trading firm. Relatedly, it was one of the few projects Id worked on where having 95% unit tests and 5% integration tests made perfect sense. It fitted the nature of the code, which wasnt typical of most businesses.
Somebody else wrote that they wrote a lot of numerical code in another business for which property testing is extremely useful and again, I dont doubt that either. 95% is still != 100% though.
Not to derail but what’s more efficient in your view? We compared messagepack, standard http/json and probufs for an internal service and protobufs came out tops on every measure we had.
The gold standard is a purpose-built protocol for each message, usually coming in ~20x faster and ~2-8x smaller than a comparable proto (it's perhaps obvious why Google doesn't do this, since the developer workload is increased for every message even in a single language, and it's linear in the number of languages you support, without the ability to shove most of the bugginess questions to a single shared library, and backwards compatibility is complicated with custom protocols -- they really do want you to be able to link against most g3 code without interop concerns). I've had a lot of success in my career with custom protocols in performance-sensitive applications, and I wouldn't hesitate to do it again.
Barring that though, capnproto and flatbuffers (perhaps with compression on slow networks) are usually faster than protos. Other people have observed that performance deficit on many occasions and made smaller moderately general-purpose libraries before too (like SBE). They all have their own flavors of warts, but they're all often much faster for normal use cases than protos.
As a hybrid, each project defining its own (de)serializer library can work well too. I've done that a few times, and it's pretty easy to squeeze out 10x-20x throughput for the serialization features your project actually needs while still only writing the serialization crap once and reusing it for all your data types.
Recapping on a few reasons why protos are slow:
- There's a data dependency built into the wire format which is very hard to work around. It blocks nearly all attempts at CPU pipelining aND vectorization.
- Lengths are prefixed (and the data is variable-length), requiring (recursively) you to serialize a submessage before serializing its header -- either requiring copies or undersized syscalls.
- Fields are allowed to appear in any order, preventing any sort of code which might make the branch predictor happy.
- Some non-"zero-copy" protocols are still quite fast since you can get away with a single allocation. Since several decisions make walking the structure slow, that's way more expensive that it should be for protos, requiring either multiple (slow) walks or recursive allocations.
- The complexity of the format opens up protos to user error. Nonsense like using a 10-byte slow-to-decode-varint for the constant -1 instead of either 1, 4, or 8 fast-to-decode bytes (which _are_ supported by the wire format, but in the wild I see a lot of poorly suited proto specs).
- The premise in the protocol that you'll decode the entire type exactly as the proto defines prevents a lot of downstream optimizations. If you want a shared data language (the `.proto` file), you have to modify that language to enforce, e.g., non-nullability constraints (you'd prefer to quickly short-circuit those as parse errors, but instead you need extra runtime logic to parse the parsed proto). You start having to trade off reusability for performance.
And so on. It's an elegant format that solves some real problems, but there are precious few cases where it's a top contender for performance (those cases tend to look like bulk data in some primitive type protos handle well, as opposed to arbitrary nesting of 1000 unrelated fields).
Specific languages might have (of course) failed to optimize other options so much that protos still win. It sounds like you're using golang, which I've not done much with (coming from other languages, I'm mildly surprised that messagepack didn't win any of your measurements), and by all means you should choose tools based on the data you have. My complaints are all about what the CPU is capable of for a given protocol, and how optimization looks from a systems language perspective.
What does a 'purpose-built protocol for each message' look like? You avoid type/tagging overhead, but other than that I'd expect a ""sufficiently smart"" generic protocol to be able to achieve the same level of e.g. data layout optimization. Obviously ProtoBuf in particular is pessimising for the reasons you describe, but I'm thinking of other protocols (e.g. Flatbuffers, Cap'n Proto, etc.)
The problem is that "sufficiently smart" does a lot of heavy lifting.
One way to look at the problem is to go build a sufficiently smart generic protocol and write down everything that's challenging to support in v1. You have tradeoffs between size (slow for slow networks), data dependencies (slow for modern CPUs), lane segmentation (parallel processing vs cache-friendly single-core access vs code complexity), forward/backward compatibility, how much validation should the protocol do, .... Any specific data serialization problem usually has some outside knowledge you can use to remove or simplify a few of those "requirements," and knowledge of the surrounding system can further guide you to have efficient data representations on _both_ sides of the transfer. Code that's less general-purpose tends to have more opportunities fore being small, fast, and explainable.
A common source of inefficiencies (protobuf is not unique in this) is the use of a schema language in any capacity as a blunt weapon to bludgeon the m x n problem between producers and consumers. The coding pattern of generating generic producers/consumers doesn't allow for fine-tuning of any producer/consumer pair.
Picking on flatbuffers as an example (I _like_ the project, but I'll ignore that sentiment for the moment), the vtable approach is smart and flexible, but it's poorly suited (compared to a full "parse" step) to data you intend to access frequently, especially when doing narrow operations. It's an overhead (one that reduces the ability for the CPU to pipeline your operations) you incur precisely by trying to define a generic format which many people can produce and consume, especially when the tech that produces that generic format is itself generic (operating on any valid schema file). Fully generic code is hard enough to make correct, much less fast, so in the aim of correctness and maintainability you usually compromise on speed somewhere.
For that (slightly vague) flatbuffers example, the "purpose-built protocol" could be as simple as almost anything else with a proper parse step. That might even be cap'n proto, though that also has problems in certain kinds of nested/repeated structures because of its arena allocation strategy (better than protobuf, but still more allocations and wasted space than you'd like).
Even trading companies have a ton of project and code which you'll find at any reasonably sized tech company, the algo-heavy code is a small fraction of the total code they write. In this sense, they are not such an outlier just based on the business they are in - I think the use of a functional language, good tooling and education around PBT are much more important factors.
I have to admit that after all these years the term "integration tests" is near meaningless me. But proberty based testing is absolutely real. If you can think of one example based test you can probably think of a way to generalise them. It's a powerful idea, much more powerful than trying to remember what exactly a "unit" means.
> My work is motivated by conversations with real PBT users, accentuating the benefits that they get from PBT and reducing the drawbacks.
ISTM that the benefits and drawbacks are just two sides of the same coin. The benefit is reduced testing effort. The drawback is that savings comes from sacrificing knowledge of exactly what has been tested.
> I also observed that PBT users are not always good at evaluating whether their testing was effective.
Ah, there's the rub. It seems that the benefit of making the tested property explicit is partially offset by not really knowing how much of the search space was explored automatically.
I wouldn't agree with this assessment. PBT has frequently surprised me, finding bugs that I never would have thought to test. It's not merely a tool for generating the unit tests we would have written anyway; it's a way to write down how we think the parts of a system interact, and have the machine show us that we're wrong. Crucially, it's quite hard to prove ourselves wrong, since it's hard to think up scenarios that we hadn't considered, so it's nice to have the machine help with that part.
> The drawback is that savings comes from sacrificing knowledge of exactly what has been tested.
I encourage you to use the statistics-gathering features that many (most?) PBT systems provide (e.g. functions with names like "collect", "label", etc.). Also, whilst preconditions/filtering can be useful in a pinch, it's usually a good idea for a property to construct data of the appropriate sort itself (e.g. using arbitrary arguments to fill in an appropriate template); rather than relying on a completely arbitrary generator to hit scenarios you care about.
I think the surprise part in under-appreciated. The quicktest tests I write are frequently of the form “let’s spend 15 minutes and see what happens” kind of tests - and I almost always find bugs that way. The return on investment is bonkers.
I do think that immutable by default OCaml + good PBT tooling there helps a lot.
In the testing I focus on, testing of compilers, and in particular Common Lisp implementations, PBT has been invaluable, finding bugs in every implementation on which it was tried (I understand this is a universal experience in compiler testing, for example seen from Csmith on C compilers). Here, the property is often of the form "these two related pieces of code should not crash and also compute the same thing". As a simplified example, an expression like
(+ x y)
should compute the same thing as
(the <type1> (+ (the <type2> x) (the <type3> y)))
where the types are randomly generated types that contain the values in question.
My main issue with that approach is that your tests become flaky due to random data generation in your tests. They can be fine in your staging build pipeline and fail at your prod build pipeline. Extensive logging doesn't change that fact.
You can fix the PRNG seed in your build pipelines, which should make them execute the same examples for each test.
Hypothesis has a nice feature which uses a hash of each test's source code as the PRNG seed for that test, which ensures the same examples are checked every time (unless someone changes the text of that particular test). It's a nice idea, but might be hard to implement in compiled languages (like Haskell, for example).
This work was discussed by the author in the Haskell Interlude podcast as well [0]. Highly recommended and probably easier to digest than a whole dissertation.
[0]: https://haskell.foundation/podcast/59/
The python package Hypothesis[0] already does a great job bringing property-based testing to the people! I've used it and it's extremely powerful.
[0]: https://github.com/HypothesisWorks/hypothesis
I have used Python's `hypothesis` as well, and I wish it were better. We had to rip it out at work as we were running into too many issues.
I have also used Haskell's `QuickCheck` and Clojure's `spec` / `test.check` and have had a great experience with these. In my experience they "just work".
Conversely, if you're trying to generate non-trivial datasets, you will likely run into situations where your specification is correct but Hypothesis' implementation fails to generate data, or takes an unreasonable amount of time to generate data.
Example: Generate a 100x25 array of numeric values, where the only condition is that they must not all be zero simultaneously. [1]
[1] https://github.com/HypothesisWorks/hypothesis/issues/3493
I understand your pain in some sense, but on another I feel like people with a decent amount of hypothesis experience "know" how the generator works and would understand that you basically _never_ want to use `filter` if you can avoid it, instead relying on unfalsifiable generation.
Silly idea for your generator would to generate an array, and if it's zero... draw a random index and a random non-zero number and add it into the array. Leads to some weird non-convexity properties but is a workable hack.
In your own example you turned off the "data too slow" issue, probably because building up a dataframe (all to just do a column sum!) is actually kind of costly at large numbers! Your complaint is probably actually meant for the pandas extras (or pandas itself) rather than the concept of hypothesis.
No, I ran into the same issues with basic data structures. The dataframe wasn’t necessary, it just matched the expected input of some function I wanted to test.
I took your case, I got way better perf just generating a list of numbers and then reshaping it into a dataframe.
But! Even though it doesn't even get that much slower at a certain number of rows it just starts hanging! Like at 49 rows everything is still fine and at 50 it no longer wants to work. It's very bizarre and I'll see if I can debug it. But I think your test case isn't indicative of some fundamental issue with Hypothesis rather than some sort of bug.
> Even though it doesn't even get that much slower at a certain number of rows it just starts hanging
Yes, this brings back memories. I've definitely seen this kind of behaviour as well, in many different, not-particularly-exotic, situations.
I am absolutely convinced the issue I raised on the github project was a bug or a defect, despite the maintainers not taking it seriously.
I find QuickCheck and Clojure spec/test.check much more straightforward to use. I just never ran into this sort of thing with these other tools.
That kind of behavior can happen at the threshold of Hypothesis' internal limit on entropy - though if you're not hitting HealthCheck.data_too_large then this seems unlikely.
Let me know if you have a reproducer, I'd be curious to take a look.
Not weighing in on any particular tech (hypothesis or otherwise), but intrigued by your example...
My initial impulse is to pick a random cell which must not be zero, generate a random number for each other cell and a random non-zerp number for that one. I'm not immediately decided on whether it's uniformly distributed.
I would pick the number of non-zeros first, assert that it's non-zero, then continue filling in the values themselves. And probably not with a uniform distribution.
Any algorithm that cares about the number of non-zeros could have non-trivial interactions with their arrangement and count, so picking something that generates non-trivial sparsity (and doesn't just make the array look like white noise) is going to have the best chance of exposing interesting behavior. The tricky part is thinking through how to generate "interesting" patterns, which admittedly I haven't put enough thought into.
Ah, yeah, generating for property testing probably doesn't want a uniform distribution. What patterns are interesting will surely depend on what we're doing with the array.
Indeed, the world is not IID. As such, test cases should not be a uniformly distributed sample of some mathematical distribution.
> Indeed, the world is not IID.
Right! And even if it were, in the sense that that's what we should expect as real world input, it wouldn't generally be the best distribution for finding bugs.
As the comments on your linked issue point out:
(a) Filtering is a last resort and is best avoided. As an example, the Gen type in Haskell's falsify package can't be filtered, since it's a bad idea. As another example, ScalaCheck's Gen type can be filtered, but they also allow "retries" (by default, up to 10,000 times), because filtering is very wasteful.
(b) If you're going to filter, scope it to be as small as possible (e.g. one comment points out that you're discarding and regenerating entire dataframes, when the filter only depends on one particular column)
(c) Have some vague awareness of how your generators will shrink, to avoid infinite loops. In your case, shrinking will make it more likely to fail your filter; and the "smallest" dataframe (all zeros) will definitely fail.
It’s still very weird that the generator can’t avoid an all-zeros array. Figuring out the root cause might find something interesting.
Care to expand upon the issues you were running into with hypothesis? I'm genuinely curious as I may soon be evaluating whether to use it in a professional context.
As far as I'm aware, Hypothesis is fundamentally based around the idea of "generators are parsers of randomness" discussed in this paper; i.e. a Hypothesis "strategy" is essentially a function from bytestrings to values. To generate random values, those strategies are run on a random bytestring; to shrink a previous value, the bytestring that lead to that value is shrunk.
Haskell's "falsify" package takes a similar approach, but uses a tree of random values. This has the advantage that composite generators can run each of their parts against a different sub-tree, and hence they can be shrunk independently without interfering.
For structure generation I prefer Doug McIlroy's approach: pick a tree size (from some arbitrary distribution), and then, of the n possible valid structures of that size, produce the kth one uniformly.
https://www.cs.dartmouth.edu/~doug/nfa.pdf gives an nfa variant; extending to a pda is an (interesting, I found) exercise.
I think property based testing becomes a lot easier when you can just use normal asserts like this: https://github.com/unexpectedjs/unchecked
Not to be too critical but coming from Java/C# I‘m not not so hot they are writing their own tests: https://github.com/unexpectedjs/unchecked/blob/master/test/i...
Maybe just a matter of familiarity?
Would be very valuable if someone could write a summary of novel ideas for practitioners (if there are any).
Not read the whole thing yet, but the idea to represent generators as free monads which can be interpreted in multiple ways is very promising. In particular, this may be a way to unify various generator implementations, like the random generators from QuickCheck, hedgehog, falsify, SmartCheck, etc. with the enumerations provided by Smallcheck, FEAT, LeanCheck, etc. and maybe even the weirder ones like LazySmallcheck, extrapolate, fitspec, speculate, lazy-search, spectacular, the various logic-programming implementations of property-checking, etc.
In principle, the same framework would extend to parsers too (atto/mega/parsec, etc.), though parsing is often a bottleneck in real applications, so many would avoid such abstraction/indirection. Still, it would be nice if libraries could define parsers for their datatypes, in such a way that it's easily reusable as a generator in tests (without having the library package depending on any test-generator packages).
They also demonstrate finding the "derivative" of a generator (i.e. the result of fixing its next choice to a particular outcome), which is useful for avoiding choices that cause preconditions to fail (something which many PBT users seem to struggle with: assuming that the framework will somehow avoid problematic values, rather than just rejection-sampling them as most do).
Just going to plug the excellent .NET PBT library, CsCheck [0]. I have used it quite a bit to excersize strange corners of my program logic and found several exotic bugs with it.
[0]: https://github.com/AnthonyLloyd/CsCheck
Property testing is a lot like formal methods - really cool, but almost entirely useless in ~95% of commercial contexts.
They're both extremely useful when, say, building a parser, but when the kind of code you write involves displaying custom widgets, taking data and pushing it onto a queue, looking up data in a database, etc. integration tests have a lot more bang for the buck.
For "business applications," I much prefer the variant of property testing called model-based testing. This is where your property is "does the implementation behave like some simplified model." This is the correct way to test such applications in my opinion, because users don't care about "custom widgets," "queues," or "databases." They care about business functionality, which is quite easily expressible in model form.
I don't do much mocking either, because I also prefer integration tests for these types of applications. Here is an example that tests against a database, all the way up to the state management layer of the UI: https://concerningquality.com/model-based-testing/.
In fairness, people don't write about doing this a lot. PBT literature seems to revolve around sorting functions or binary tree operations. But PBT much more powerful than just that.
>in my opinion, because users don't care about "custom widgets," "queues," or "databases."
Users care quite a lot when these things break.
>They care about business functionality, which is quite easily expressible in model form.
They care about reliable applications that do the task they want. Users don't have a concept of "business functionality". If a widget they wanted to use is broken, they'll be pissed - they won't tell you that it's ok because the concept of the application was still good.
From your link:
>The main thing that I want to highlight here is how simple the model is. This is pure domain logic, and is probably as close to the essential complexity of the application that we can get
This is exactly the issue.
If I go down my recent list of bugs and categorize them, ~8 out of 10 of them are in areas which the author would have *abstracted away*. The other two probably could have been avoided by having a team member write one test to cover that functionality instead of zero.
What does it mean when you abstract something away from a test? Well, it means the test won't catch a bug in that thing.
If I went down my recent list of bugs and been able to see that 9 out of 10 of them would have been caught via adding property testing/model testing on top of regular testing/typing, etc. I'd have absolutely no second thoughts about implementing it but that simply isn't the case.
I have worked on problems in (not usual) environments where it is the case, and in those rare cases I have advocated for and used property testing but I have more often seen it advocated and used by people who see every bug as a nail and property testing as some kind of super advanced hammer.
> Users care quite a lot when these things break.
Users care when their expected behavior breaks. They certainly do not care why it broke, or _where_ it broke. Most users don't know what a database is.
> If I go down my recent list of bugs and categorize them, ~8 out of 10 of them are in areas which the author would have abstracted away.
Even though the model is abstract, all of the behavior of the implementation is covered in a model-based test. For example, let's say a SQL query is what's broken for some use case. This query would be exercised in the test, and it wouldn't pass until the query is fixed. That's the main reason I prefer model-based integration tests - there are no such bugs that escape.
Also, I still write targeted tests. There's nothing that prevents you from doing that, and it absolutely is useful in many cases. I just believe that most tests should be generated, and hand-written tests should be the minority.
Hah! Try to separate your domain logic from your interfaces (e.g., using something like hexagonal architecture) and then say this again.
Yes, it's a lot of work coming up with good properties, but it massively helps to find gaps in the domain logic. In my experience, these gaps are what's typically expensive, not the weird problem a junior had with properly using Redis or S3.
>Hah! Try to separate your domain logic from your interfaces
Im not an amateur.
The only time I dont do this is when there literally is no domain logic yet (e.g. a CRUD app).
>In my experience, these gaps are what's typically expensive, not the weird problem a junior had with properly using Redis or S3.
What can I say? Your experience might not be as broad as mine.
Redis is a source of almost no bugs because it is very well designed, but most interfaces I couple to have design qualities that are the exact opposite of redis's.
Those interfaces (e.g. wonky payment gateway APIs, weird microservice APIs) are the probably source of most bugs in enterprise systems I work on.
#2 is probably simple misspecifications (customer said code should do X, it should actually do Y which is almost the same but very slightly different).
#3 would be domain logic errors. And even most of those are uncovered and avoided with saner architecture or a couple of unit tests.
For the parsers I write at home, sure, property testing kicks ass. For your college degree algorithm coursework, sure, it helps a lot. For 95% of business logic? Pointless, the complexity isnt buried deep in the business logic.
It's very useful when you're working on numerical software. Often, it's hard to figure out exactly what output your code should return (because if you knew the answer you wouldn't have to write the code), but you can easily list properties you expect.
Right, metamorphic testing in particular (which would be a special case of PBT, with metamorphic relations being properties), https://en.wikipedia.org/wiki/Metamorphic_testing, https://github.com/MattPD/cpplinks/blob/master/testing.md#pr...
One simple example (from the above) is "sin (π − x) = sin x" for the implementation of the sine function not requiring the knowledge of its specific output values. Naturally, instead of the literal equality "=" one can use a more appropriate accuracy specification as in, say, relative ulp (https://en.wikipedia.org/wiki/Unit_in_the_last_place) error bound, cf. https://members.loria.fr/PZimmermann/papers/accuracy.pdf
I agree, but this is a good thing! My default approach these days is functional tests for everything possible, and property based tests for anything particularly algorithmic or containing lots of edge cases, and no unit tests outside that. This is a great combo, covers all the business value without leaving obscure bugs, and also isn’t a pain every time you refactor.
I've found it effective for anything that handles arbitrary input, especially from end-users. But if that data is coming from within your systems where you have full control over everything, less valuable.
It's useful for things that would be too painful to enforce via types. For example, "this list should be non-empty" is pretty easy to enforce via types (use a pair of head element and possibly-empty tail list). On the other hand, something like "map of non-overlapping date ranges to counts, with automatic splitting on insertion" would be pretty tricky. It's easier to state the desired properties as universally-quantified functions rather than types, at which point we might as well throw them into a few property checkers to see if they find anything we didn't think of.
If the arbitrary input is, say, a text box which takes a name and puts it into a database, it probably won't uncover any bugs.
It has some use if you build something like a complex pricing engine, numerical code or a parser for a mini DSL. I find that problems of this type don't crop up a lot though.
I disagree.
I've used it for things like "regardless of where you are on the page, tab n times and shift tab n times leaves you on the original item".
I found a bug in our tv ui library which was actually a bug in the spec. Regardless of how you built the ui, if you press a direction and focus moves, pressing the opposite direction takes you back - but we had another rule that broke this. We had tests for both, and it was only when I made the general test (for all ui, items in it and directions) it found the inconsistency.
It was also pretty easy to write.
I've also found issues around text processing due to lowercasing not always resulting in the same length string and more. I found a bug demoing pbt for a contact gig I was going for that was around some versioning.
To be honest I've never implemented it, even for a demo, and not found a bug.
> To be honest I've never implemented it, even for a demo, and not found a bug.
Me too. I tend to roll my own property testers / fuzzers per project instead of using a library. But my experience is similar to yours. Out of maybe 25 testers, I think the only times I didn’t find any bugs was when I messed up the tests themselves.
It’s incredibly humbling work.
I used to build these too, my example was one in actionscript (always fun on restricted devices) and it was a bug in a library I wrote. I was actually testing the property tester and found the bug then, after realizing it wasn't a bug in the new test/tester.
> It’s incredibly humbling work
Absolutely. Glad others feel it too.
I almost feel like you need a decent reason why not to use them in places.
Ideally to can describe the behaviour of something in general terms around what stays true. If you can't, is it too hard for people to understand?
I will however say I don't usually write pbts but have a strong feeling I should add more.
Very true. For me, fuzzers and property-based tests are two sides of the same coin. I'd just use whichever feels more natural.
Thinking about the difference, I wonder if there is a PBT framework that would exploit feedback like fuzzing does, to improve coverage.
The Choice Gradient Sampling algorithm shown in the paper could be used to integrate/exploit feedback (coverage, etc.).
Hypothesis can use a fuzzer as its input source, but it's marked as experimental.
It doesn't seem like "proper" fuzzers are used very often with Haskell (e.g. there's an example at https://www.tweag.io/blog/2023-06-15-ghc-libfuzzer/ ), though many people seem to have hooked up the HPC code coverage tool to property checkers, e.g.
https://www.cse.iitk.ac.in/users/karkare/MTP/2011-12/subhash...
https://www.doc.ic.ac.uk/~tora/irulan/issta.pdf
https://github.com/shapr/kudzu
How come e.g. Jane Street uses it so much? It’s the second most common type of test I write.
Jane Street uses OCaml and property based tests are easiest when dealing with pure functions, and are taught in FP classes usually, so I assume it’s that. Easier to setup and target audience.
Edit: also a numerical domain, which is the easiest type to use them for in my experience!
The same reason Google burns $50M+ in electricity each year using protobufs instead of a more efficient format. An individual company having specific needs isn't at odds with a general statement being broadly true.
How’s that comparable at all? There are no network effects from writing property based tests, people use them if they are helpful - are they testing enough of the code with reasonable amount of effort. Nobody’s forcing people to write tests, unlike Google forces usage of protobuf on all projects there.
It's comparable in the way described in sentence #2:
> An individual company having specific needs isn't at odds with a general statement being broadly true.
Google needs certain things more than reduced carbon emissions, and Jane Street needs certain things more than whatever else they could spend that dev time on.
Fine but cutting the thought process at "it depends" is not a great way to understand what's happening here. You can explain anything happening at any company by saying "they need certain things more than whatever else they could spend that time on".
Why is PBT useful at Jane Street, at least more than in other places? Is it the use of functional language? Average Jane Street dev being more familiar with PBT? Is the domain particularly suited to this style of testing?
Explicitly, my claim is that the biggest bottleneck is education on how to use PBT effectively and Jane Street is not using them to get an extra mile in safety, they use it because it's the easiest way to write large chunk of the tests.
>Why is PBT useful at Jane Street, at least more than in other places?
Because trading firms write a lot more algorithmic code than most businesses. Trading strategy code is intensely algorithmic and calculation heavy by its very nature as is a lot of the support code written around it.
At least, that's what it was like when I worked in a trading firm. Relatedly, it was one of the few projects Id worked on where having 95% unit tests and 5% integration tests made perfect sense. It fitted the nature of the code, which wasnt typical of most businesses.
Somebody else wrote that they wrote a lot of numerical code in another business for which property testing is extremely useful and again, I dont doubt that either. 95% is still != 100% though.
Not to derail but what’s more efficient in your view? We compared messagepack, standard http/json and probufs for an internal service and protobufs came out tops on every measure we had.
The gold standard is a purpose-built protocol for each message, usually coming in ~20x faster and ~2-8x smaller than a comparable proto (it's perhaps obvious why Google doesn't do this, since the developer workload is increased for every message even in a single language, and it's linear in the number of languages you support, without the ability to shove most of the bugginess questions to a single shared library, and backwards compatibility is complicated with custom protocols -- they really do want you to be able to link against most g3 code without interop concerns). I've had a lot of success in my career with custom protocols in performance-sensitive applications, and I wouldn't hesitate to do it again.
Barring that though, capnproto and flatbuffers (perhaps with compression on slow networks) are usually faster than protos. Other people have observed that performance deficit on many occasions and made smaller moderately general-purpose libraries before too (like SBE). They all have their own flavors of warts, but they're all often much faster for normal use cases than protos.
As a hybrid, each project defining its own (de)serializer library can work well too. I've done that a few times, and it's pretty easy to squeeze out 10x-20x throughput for the serialization features your project actually needs while still only writing the serialization crap once and reusing it for all your data types.
Recapping on a few reasons why protos are slow:
- There's a data dependency built into the wire format which is very hard to work around. It blocks nearly all attempts at CPU pipelining aND vectorization.
- Lengths are prefixed (and the data is variable-length), requiring (recursively) you to serialize a submessage before serializing its header -- either requiring copies or undersized syscalls.
- Fields are allowed to appear in any order, preventing any sort of code which might make the branch predictor happy.
- Some non-"zero-copy" protocols are still quite fast since you can get away with a single allocation. Since several decisions make walking the structure slow, that's way more expensive that it should be for protos, requiring either multiple (slow) walks or recursive allocations.
- The complexity of the format opens up protos to user error. Nonsense like using a 10-byte slow-to-decode-varint for the constant -1 instead of either 1, 4, or 8 fast-to-decode bytes (which _are_ supported by the wire format, but in the wild I see a lot of poorly suited proto specs).
- The premise in the protocol that you'll decode the entire type exactly as the proto defines prevents a lot of downstream optimizations. If you want a shared data language (the `.proto` file), you have to modify that language to enforce, e.g., non-nullability constraints (you'd prefer to quickly short-circuit those as parse errors, but instead you need extra runtime logic to parse the parsed proto). You start having to trade off reusability for performance.
And so on. It's an elegant format that solves some real problems, but there are precious few cases where it's a top contender for performance (those cases tend to look like bulk data in some primitive type protos handle well, as opposed to arbitrary nesting of 1000 unrelated fields).
Specific languages might have (of course) failed to optimize other options so much that protos still win. It sounds like you're using golang, which I've not done much with (coming from other languages, I'm mildly surprised that messagepack didn't win any of your measurements), and by all means you should choose tools based on the data you have. My complaints are all about what the CPU is capable of for a given protocol, and how optimization looks from a systems language perspective.
What does a 'purpose-built protocol for each message' look like? You avoid type/tagging overhead, but other than that I'd expect a ""sufficiently smart"" generic protocol to be able to achieve the same level of e.g. data layout optimization. Obviously ProtoBuf in particular is pessimising for the reasons you describe, but I'm thinking of other protocols (e.g. Flatbuffers, Cap'n Proto, etc.)
The problem is that "sufficiently smart" does a lot of heavy lifting.
One way to look at the problem is to go build a sufficiently smart generic protocol and write down everything that's challenging to support in v1. You have tradeoffs between size (slow for slow networks), data dependencies (slow for modern CPUs), lane segmentation (parallel processing vs cache-friendly single-core access vs code complexity), forward/backward compatibility, how much validation should the protocol do, .... Any specific data serialization problem usually has some outside knowledge you can use to remove or simplify a few of those "requirements," and knowledge of the surrounding system can further guide you to have efficient data representations on _both_ sides of the transfer. Code that's less general-purpose tends to have more opportunities fore being small, fast, and explainable.
A common source of inefficiencies (protobuf is not unique in this) is the use of a schema language in any capacity as a blunt weapon to bludgeon the m x n problem between producers and consumers. The coding pattern of generating generic producers/consumers doesn't allow for fine-tuning of any producer/consumer pair.
Picking on flatbuffers as an example (I _like_ the project, but I'll ignore that sentiment for the moment), the vtable approach is smart and flexible, but it's poorly suited (compared to a full "parse" step) to data you intend to access frequently, especially when doing narrow operations. It's an overhead (one that reduces the ability for the CPU to pipeline your operations) you incur precisely by trying to define a generic format which many people can produce and consume, especially when the tech that produces that generic format is itself generic (operating on any valid schema file). Fully generic code is hard enough to make correct, much less fast, so in the aim of correctness and maintainability you usually compromise on speed somewhere.
For that (slightly vague) flatbuffers example, the "purpose-built protocol" could be as simple as almost anything else with a proper parse step. That might even be cap'n proto, though that also has problems in certain kinds of nested/repeated structures because of its arena allocation strategy (better than protobuf, but still more allocations and wasted space than you'd like).
Trading companies are unusual in writing a lot of algo-heavy code. Did you assume every company was like this?
I can assure you they arent.
Even trading companies have a ton of project and code which you'll find at any reasonably sized tech company, the algo-heavy code is a small fraction of the total code they write. In this sense, they are not such an outlier just based on the business they are in - I think the use of a functional language, good tooling and education around PBT are much more important factors.
>the algo-heavy code is a small fraction of the total code they write
Wasnt the case in the trading firm I worked at.
Do you work at Jane Street? Have you worked elsewhere?
Just because a company uses something doesn't mean all companies should. May as well use monorepos in that case
I have to admit that after all these years the term "integration tests" is near meaningless me. But proberty based testing is absolutely real. If you can think of one example based test you can probably think of a way to generalise them. It's a powerful idea, much more powerful than trying to remember what exactly a "unit" means.
> My work is motivated by conversations with real PBT users, accentuating the benefits that they get from PBT and reducing the drawbacks.
ISTM that the benefits and drawbacks are just two sides of the same coin. The benefit is reduced testing effort. The drawback is that savings comes from sacrificing knowledge of exactly what has been tested.
> I also observed that PBT users are not always good at evaluating whether their testing was effective.
Ah, there's the rub. It seems that the benefit of making the tested property explicit is partially offset by not really knowing how much of the search space was explored automatically.
> The benefit is reduced testing effort.
I wouldn't agree with this assessment. PBT has frequently surprised me, finding bugs that I never would have thought to test. It's not merely a tool for generating the unit tests we would have written anyway; it's a way to write down how we think the parts of a system interact, and have the machine show us that we're wrong. Crucially, it's quite hard to prove ourselves wrong, since it's hard to think up scenarios that we hadn't considered, so it's nice to have the machine help with that part.
> The drawback is that savings comes from sacrificing knowledge of exactly what has been tested.
I encourage you to use the statistics-gathering features that many (most?) PBT systems provide (e.g. functions with names like "collect", "label", etc.). Also, whilst preconditions/filtering can be useful in a pinch, it's usually a good idea for a property to construct data of the appropriate sort itself (e.g. using arbitrary arguments to fill in an appropriate template); rather than relying on a completely arbitrary generator to hit scenarios you care about.
I think the surprise part in under-appreciated. The quicktest tests I write are frequently of the form “let’s spend 15 minutes and see what happens” kind of tests - and I almost always find bugs that way. The return on investment is bonkers.
I do think that immutable by default OCaml + good PBT tooling there helps a lot.
In the testing I focus on, testing of compilers, and in particular Common Lisp implementations, PBT has been invaluable, finding bugs in every implementation on which it was tried (I understand this is a universal experience in compiler testing, for example seen from Csmith on C compilers). Here, the property is often of the form "these two related pieces of code should not crash and also compute the same thing". As a simplified example, an expression like
(+ x y)
should compute the same thing as
(the <type1> (+ (the <type2> x) (the <type3> y)))
where the types are randomly generated types that contain the values in question.
Nice work. I didn't yet read it fully, but I love the idea. Looks to be a valuable thesis.
I read that as "Property-Based Testing of People"
My main issue with that approach is that your tests become flaky due to random data generation in your tests. They can be fine in your staging build pipeline and fail at your prod build pipeline. Extensive logging doesn't change that fact.
You can fix the PRNG seed in your build pipelines, which should make them execute the same examples for each test.
Hypothesis has a nice feature which uses a hash of each test's source code as the PRNG seed for that test, which ensures the same examples are checked every time (unless someone changes the text of that particular test). It's a nice idea, but might be hard to implement in compiled languages (like Haskell, for example).
LLMs are great at generating property-based tests.