Skip to content

BUG: Moore-Penrose Pseudoinverse numerical stability/consistency affected by recent OpenBLAS activity #5866

Description

@tylerjereddy

I noticed a test failure downstream where a machine learning estimator was performing a partial_fit that leverages the Moore-Penrose pseudoinverse for a closed form solution (lanl/GFDL#117). Although only 1/100 elements mismatched, the relative tolerance diff was quite substantial on that element (1.46476215) vs. previous NumPy/OpenBLAS combinations. This failure showed up starting with NumPy 2.5.0 from 3 days ago, which shipped with a newer version of OpenBLAS.

I then dissected out a plain text, but rather lengthy, reproducer on the NumPy level. I've pushed it up to GitHub at https://github.com/tylerjereddy/pinv_issue, and it prints out a single element of np.pinv output from a fixed 2D input. The input is a (25, 25) array carefully dissected out of our in house testsuite--the size of that array in plain text is the main reason I've provided it in a separate repo--it is perhaps too large to dump here and so you can confirm it is plain text, etc.

Below are the results it prints in a few different scenarios -- the main problem with consistency appears to be with the OpenBLAS that ships with the latest x86_64 NumPy Linux wheels, since building NumPy 2.5.0 from source with an older OpenBLAS passes our testing just fine. I realize that these are just "proxy indicators" for the problem rather than the full picture, especially since different values on different platforms and with different linalg backends are acceptable, but nonetheless, the outcome on NumPy 2.5.0 with OpenBLAS from wheel is not acceptable in the broader testing, so I tried to cut this down to a proxy reproducer. Just to emphasize here--it is "ok" that some of the values in the table below are different, but the 0.3609... value with NumPy 2.5.0 + the OpenBLAS that ships with that wheel, is not ok (though this may only be part of the total numerical instability story). It is likely that the input array is pathological in a variety of ways, and this is not uncommon for batched input over a design matrix. Nonetheless, most other platforms and linalg backends and older OpenBLAS versions are doing just fine here in the broader numerical testing we do, and I wanted to keep the proxy test focused on a single float value for now.

NumPy version Result
NumPy 2.5.0 wheel on x86_64 Linux 0.360954047299216
NumPy 2.5.0 built from source on x86_64 Linux with OpenBLAS 0.3.20 0.5567409444424644 (passes our broader numerical testing)
NumPy 2.4.6 wheel on x86_64 Linux 0.08970465208832239
NumPy 2.3.4 wheel on x86_64 Linux 0.08970465208832239
NumPy 2.5.0 wheel on M series Mac 0.1422214449163197
NumPy 2.4.6 wheel on M series Mac 0.1422214449163197

I then talked to @seberg and @ngoldbaum and they suggested I bisect on OpenBLAS and open an issue here, so I bisected OpenBLAS alongside NumPy 2.5.0 to find the first point where x86_64 Linux + OpenBLAS produces the problem value above with NumPy 2.5.0 (the 0.3609... value) from wheel install (see below) when linked to NumPy. In fact, I ended up needing to use LD_PRELOAD to get the right behavior (force NumPy to link the correct source-built OpenBLAS, pkg config env vars/settings were not sufficient on their own...). I realize that this isn't a reproducer at the C level for OpenBLAS, and that these single proxy floats are just part of the story, but thought I'd get this started. I can work toward that if you really need it.

To further complicate matters, some of the commits in the bisection had to be skipped because they failed to build; presumably those commits really are broken for other reasons. So, the best I could do was this output that narrows it down to two possible problem commits

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
f5f789fc529ffcc4fb7dd5f525fee1e719339a3a
d9bb8f369f5d80cb0b074f01bde156a6dda1f193
We cannot bisect more!

Those both reference Reference-LAPACK PRs. I did confirm that the latest develop commit (a36e22c) is still broken (produces 0.360954047299216 using the reproducer above on x86_64 Linux).

Sorry if this is a little tricky to follow, but it is the result of several hours of careful dissection and I did not use AI--this is a real problem affecting our estimator research at DOE, and we're happy to help out if we can, but would probably need a bit more guidance. I wonder if an expert can look at the suspicious commits with a knowledge of the ops used by pinv and get a pretty good sense for where the issue is.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions