nestorD 2 days ago

I have done my PhD on measuring floating-point error in large numerical simulations. In my experience, most of the time, if 64-bit floating points are not enough, then trying to increase precision further is unlikely to help much (there are exceptions).

In general, using a well-placed compensated (or exact) sum/dot product algorithm will solve most of your problems at a low cost. Nowadays, most people are familiar with Kahan summation (and related tricks like sorting numbers), but they do not know that it does not help much (often less than 128-bit floating points) and that the state of the art for those algorithms has moved quite a bit since. Nowadays, you can ensure that your sum/BLAS operation produces a result within one ulp of the exact result!

I highly recommend perusing the Handbook of Floating-Point Arithmetic (the Accurate Algorithm online page also seems fine [0]) to get an idea of what can be done. Implementations of these ideas can be found in most languages used for numerical computation (a quick Google search gave me options for C++/Python [1], Rust [2], etc.).

[0]: https://accurate-algorithms.readthedocs.io/en/latest/index.h...

[1]: https://github.com/yafshar/xsum

[2]: https://github.com/bsteinb/accurate

  • credit_guy 7 hours ago

    Here's an example where higher precision could be useful. You need to do an expectation of some function over a distribution. You only have the moment generating function for the distribution. The moments are the derivatives of this MGF, but a rough rule of thumb is that if the precision of a function is epsilon, then the precision of it's n'th derivative is epsilon^(1/n). If you have 12 significant digits for the MGF, then you can count on only 1 significant digit for the 12'th derivative. Let's say that this is simply unacceptable for you, you need 4 digits, so you can use the first 3 derivatives only. If you double the precision, you can use 6 derivatives, so you can match 6 moments. This might be enough for your problem. Of course, you can find better algorithms, like you use cumulants instead of moments, find a way to devise a Gaussian quadrature method (a Golub-Welsch algorithm) that uses these cumulants, etc. But that requires doing a lot of hard math (with the inevitable initial bugs), while plugging in a double precision library gives you instantaneous results.

    • nestorD an hour ago

      Yes! And no!

      You will indeed need higher precision if you need a very large number of digits (more than what double precision gives you).

      However, even with high precision, the algorithms you use to compute your moments matter: you might not have as many digits as you hope for. Variance, for example, is famously numerically unstable (sometimes giving a negative variance) when computed naively. A stable summation algorithm plugged into a variance computation (even a carefully written one) can sometimes give you six additional digits of precision in the variance that you might not get when doubling the precision (it depends on how big of a cancellation you had in your operations; too big, and doubling the precision will not be enough to recover, while a compensated summation would do fine).

      I have seen this play out once in a clustering algorithm that would only work on some data when using a compensated summation to compute the variance.

  • PaulHoule 2 days ago

    That’s what I was taught about numerics doing my PhD in the 1990s. I was taught that you shouldn’t even use doubles if you could use singles.

  • sebzim4500 2 days ago

    That's also my experience.

    It's interesting that 64 bits are often necessary (as opposed to 32) but 128 bits so rarely are.

    • nestorD 2 days ago

      Yes! I suspect that it boils down to 32 bits not being that much (you can enumerate them to test a function), while 64 bits is a lot. By that point, if you are having problems, then you are likely hitting a wall that will not be crossed by throwing additional bits at the problem.

  • KolenCh 2 days ago

    Thanks, it is very helpful. 2 questions:

    1. What’s the implications when parallelization is considered? I.e. is those accurate sum algorithms parallelizable, and how would it compares to traditional parallelized sum? 2. What is the performance penalty I should be expecting?

    • nestorD 2 days ago

      1. Parallelization: Yes, they do parallelize! Conceptually, it is fairly easy to use accurate algorithms locally, then reduce the local accumulators. A side effect of doing this is cutting (or reducing) the noise in the output introduced by the parallelization (note that this is not to be confused with deterministic summation/BLAS, which enforces the order of operations to achieve the same side effect but do not improve accuracy).

      2. It varies wildly depending on the algorithm you are using, the implementation of said algorithm, and the operations you are performing outside of the summation (if you sum the results of a loop performing non-trivial operations, that can easily overlap with the summation overhead making it essentially free). The only rule of thumb I can give is that runtime cost increases with desired accuracy. My recommendation would be to switch from a naive sum to using an exact accumulator and observe the difference in output value (letting you know how impactful numerical errors in that sum were) and runtime (letting you know how expensive the switch is for you), then dial it down if needed.

      • KolenCh 2 days ago

        Thanks! That’s very interesting! Now I need to find my next project to try this…

Conscat 2 days ago

Does anyone have examples of calculations that need quadruple precision? My last work place seemed fine using double precision for simulating satellites in orbit, and I have the imprecision double precision makes black holes and quantum physics simulation work correctly, so I have no intuition for this.

  • gh02t 2 days ago

    I see it in statistics. It's not uncommon in a lot of statistical algorithms to need to take the product (or sum) of a lot of very small numbers and then divide by that product when computing likelihoods of a bunch of independent measurements. There are some algebra tricks you can use to get around loss of precision like working in the log domain or some tedious scaling tricks, but working in quad precision can help when those aren't enough.

    The article also gives an example from quantum physics. Some chaotic or highly numerically unstable problems just need quad precision to not blow up; primarily problems/algorithms (can be the problem itself or the algorithm to solve it that causes the issues) that tend to catastrophically amplify small errors. Matrix inversion is another example, to the point that most algorithms try to do almost anything to avoid explicitly inverting a matrix. But sometimes you just have to.

    • Dylan16807 2 days ago

      To put some specifics on "a lot of very small numbers", it's around a million numbers with part-per-billion precision, or a billion numbers with part-per-million precision, where you approach the limits of naive summation in double precision.

      • gh02t 2 days ago

        For sums, yes, but products and division by small numbers are the real killer and those come up a lot in stats. Hence why you try to avoid them by working with log-probabilities (where products become sums), but sometimes you can't. Quad precision is a bit of a bandaid that just pushes the issue off when you don't have a better algorithm, but it works sometimes.

        • belkala a day ago

          How so? All primitive operations are backward stable, and unlike addition and subtraction, division and multiplication are well-conditioned.

    • slashdave 2 days ago

      Matrix inversion precision can be fixed iteratively (at a CPU cost, of course). Nearly singular matrices should probably be factored in any case.

      • gh02t 2 days ago

        Probably, that falls into "doing almost anything to avoid inverting a matrix explicitly", but direct inversion is the classic example here.

        • kolbe 2 days ago

          For numerical PDE solvers, you never do the actual inversion, but the discretization of the system (denoted A) can have significant sensitivity to numerical precision. Look up Condition Number and Preconditioning. In fact, resolving the problems with numerical precision for some A takes as much time as actually solving the system, and it's worth it.

          • gh02t 2 days ago

            Oh I'm very aware, I'm lumping those all into "tricks used to avoid/work around the numerical issues associated with matrix inversion" for simplicity, because explicit computation of a matrix inverse is one of the classic examples of a numerically unstable task. Hence why a large subfield of applied math can be summarized as "coming up with ways to avoid matrix inversion." PDE solvers like you mention are one of the main applications for that.

            Tricks like clever factorization (which a lot of factorization algorithms have their own severe numerical issues e.g. some of the ways to compute QR or SVD), preconditioners, sparse and/or iterative algorithms like GMRES, randomized algorithms (my favorite) etc are all workarounds you wouldn't need if there was a fast and stable way to exactly invert any arbitrary non-singular matrix. Well, you would have less need for, there are other benefits to those methods but improving numerical stability by avoiding inverses is one of the main ones.

  • gmiller123456 2 days ago

    One situation that already uses two doubles, which you may already be familiar with, is the International Astronomical Union's Standards of Fundamental Astronomy library. It uses two doubles for the date in order to be able to be able to have pico/nano second precision over thousands of years. Of course the algorithms aren't precise to that level over thousands of years, but you still need the ability to specify it to that level during the time periods they are that accurate for.

    • mitthrowaway2 2 days ago

      Interesting. I would have thought that fixed-point would make more sense for a time quantity. If you use a 64-bit integer to represent picoseconds, it gives you a time range of about 18 million years.

      • Dylan16807 2 days ago

        You calculated something wrong. 2^64 picoseconds is only half a year.

        Once you decide you need to go over 64 bits, the exact data type is largely a matter of convenience, and I can easily see two doubles being more convenient than either a quad or a 128-bit integer. You have more than 100 bits of precision in such a scheme, and it runs pretty fast.

        • mitthrowaway2 2 days ago

          Ah good catch, my bad! 18 million seconds, not 18 million years. Oops =)

  • mitthrowaway2 2 days ago

    It's a very quick and easy / lazy way to measure the floating-point error in your existing double-precision algorithms; run the computation at quadruple precision and check for differences in the output. If there's any significant difference then you take a closer look.

  • physicsguy 2 days ago

    You can easily get catastrophic cancellation in physical simulations, people are just not very aware of it.

    If for e.g. you see bare summation rather than some sort of sorting by magnitude then expect bad results as numbers get bigger.

    • bee_rider 2 days ago

      It is a bit weird though, because people also get very, very clever about designing their physical simulations so that they don’t have to care too much about their working precision.

      • physicsguy 2 days ago

        Yeah, GPUs actually helped with this a lot when they first came out because suddenly to get the great performance you had to drop to single precision, at least until 2018 or so (can’t remember exactly now, but it used to be the case that the older cards were 2x as slow at DP)

        • bee_rider 2 days ago

          I don’t really do GPU stuff, but I expect double to be at least 1/2 as slow as single, even on good hardware. I mean it is twice as many bits. (Maybe you think in bits/second, though, which is totally fair).

          I looked up my consumer card on a whim (playing with sprinkling some GPGPU into my code—cupy is actually really easy to use and ergonomic!), and apparently doubles are 1/32 as fast as singles. Yikes!

  • colechristensen 2 days ago

    Any long running calculation where the new state depends on the old state and there isn't much of a correction factor. Anything with "sensitive dependance on initial conditions", any dynamical systems.

    For example: simulating the three body problem (three celestial bodies of equalish size orbiting eachother). Very small differences get amplified, and any difference at all in how the math is implemented on different systems will result in divergent solutions.

  • gnulinux 2 days ago

    I use it for generating music through physical models. If the model requires a very numeric-sensitive algorithm like a differential equation solver, it's nice to have the extra precision, and usually performance doesn't matter too much in these sort of things (because ultimately, I just click "render" and wait anyway). To be very honest, I don't even know if I need 128bits, maybe a 32bit float would do... I never ran a comparative analysis. It just gives me more room for error, so that I can focus on the music/sound and not too much on the numerical analysis/programming. Working on hobby music+programming projects is the ultimate freedom.

    • Dylan16807 2 days ago

      I'm glad you're having fun but if you never compared it to double and single then you're not giving a valid example. Especially if you would have picked even higher if it was available.

    • sfpotter 2 days ago

      Basically guaranteed that you don't need quad for this.

    • a1369209993 2 days ago

      Of course, you also need to use solid gold audio cables, or you're just throwing away the extra precision via poor signal fidelity.

sfpotter 2 days ago

The only real place for quad precision is when you're working with numbers that truly require you to be able span a massive difference in magnitudes without round-off error accumulating.

For some context, an electron "has structure" roughly on a length scale of about 10^-15 meters (details here unimportant). If you're representing a quantity in O(1) meters, then doing a handful of floating-point operations in double will likely accumulate error on the order of O(1) electrons. Incredible accuracy!

I think the key here is to be mindful of units and make sure you're combining like with like and scaling things appropriately as you go. If you're doing things the wrong way, quad will punish you eventually, too.

An underappreciated tool for working more confidently with floats is interval arithmetic. Worth checking out if you actually (truly) need to do things robustly (most people don't).

  • andrepd 2 days ago

    That's one of the (many) problems with ieee floats: they "waste" as much bits for mantissa precision in the range of 10^0, 10^1, 10^2, etc, as they do in 10^300 or 10^-300!

    Proposals like Posits aim to improve on this :)

    • nestorD 2 days ago

      In practice, Posits grabs very few additional bits and those are mostly felt in low bits (16 and less) regimes. That's a big reason it is mostly seeing adoption in memory constraint regimes.

      A great thing in the Posit proposal is that they also consider a distinct accumulator type for things like sums and dot product. In my experience (I did a PhD on measuring floating point error in large numerical simulations) those operations are the ones that are most likely to sneak on you and need fixing. But most people don't realize that those accumulator type are actually easy to use with IEEE floats, they are entirely orthogonal to using Posits.

    • sfpotter 2 days ago

      IEEE floats are an incredible feat of engineering. The people who designed them were geniuses. We should be appreciative of the fact that we have a consistent and portable way to do numerics. I'm glad we've got IEEE 754 and don't have to suffer through the pain of numerous competing standards.

awoimbee 2 days ago

> The long double type varies dramatically across platforms: > [...] > What sets numpy_quaddtype apart is its dual-backend approach: > Long Double: This backend uses the native long double type, which can offer up to 80-bit precision on some systems allowing backwads compatibility with np.longdouble.

Why introduce a new alias that might be 128 bits but also 80 ? IMO the world should focus on well defined types (f8, f16, f32, f64, f80, f128), then maybe add aliases.

  • bee_rider 2 days ago

    Maybe it depends on the application?

    If you have written a linear system solver, you might prefer to express yourself in terms of single/double precision. The user is responsible for knowing if their matrix can be represented in single precision (whatever that means, it is their matrix and their hardware after all), how well conditioned it is, all that stuff. You might rather care that you are working in single precision, and that there exists a double precision (with, it is assumed, hardware support, because GPUs can’t hurt us if we pretend they don’t exist) to do iterative refinement in if you need to.

  • 0cf8612b2e1e 2 days ago

    Really seems like propagating the current platform inconsistencies into the future. Stick with 128 always, performance be damned. Slow code is much preferable to subtlety broken because you switched the host OS.

    • gnulinux 2 days ago

      Especially if you need 128-bit float precision. It's very well known and understood that quad float is much slower in most platforms, extremely slow in some. If you're using quad float, it's because you absolutely need need all 128 bits, so why even reduce it to 80 bits for "performance"? Seems like an irrelevant design choice. Programmer can choose between f128 vs f80 if performance is intractable in the target platform.

jabl 2 days ago

As an aside, the GNU Fortran compiler (gfortran) also provides a quad precision floating point type, similar to this one. If the target supports native quad precision it uses that, otherwise it falls back to software emulation.

Math functions are provided via libquadmath. Newer glibc also provides quad precision functions with slightly different names, IIRC using a f128 suffix rather than "q" like libquadmath.

  • slashdave 2 days ago

    gnu in general (c++ for example).

  • Archit3ch 2 days ago

    > Math functions are provided via libquadmath

    Notably lacking support for ARM.

JanisErdmanis 2 days ago

Unlikely to be user here, just curious. If one would be willing to use numpy_quaddtype with scipy or other libraries that depend on the numpy would that be possible without adding this library as dependency to them?

jofer 2 days ago

I am very happy to see that the article _immediately_ explains how this is different from long-existing np.float128/np.longdouble.

I was scratching my head for a bit there... It's also good to see that custom dtypes have come a long way. This is a great application of a custom dtype, and that wouldn't have been possible a few years back (well, okay, a "few" to me is probably like 8 years, but still).

OutOfHere 2 days ago

Is there something here for people to use?

Can the comparison also be done using the max precision unsigned integers instead of just float64?

  • slashdave 2 days ago

    In my experience, it can be helpful for quickly fixing precision issues. Sometimes you just want to get on with your calculations and not go back and reformulate your numerical weak points. It's less necessary with good practice.

kolbe 2 days ago

Until my hardware implements f128, even if I had a solid use for it, I cannot sacrifice the speed relative to f64.

Archit3ch 2 days ago

Should be fun for audio circuit simulations!