BUG: Moore-Penrose Pseudoinverse numerical stability/consistency affected by recent OpenBLAS activity

I noticed a test failure downstream where a machine learning estimator was performing a `partial_fit` that leverages the Moore-Penrose pseudoinverse for a closed form solution (https://github.com/lanl/GFDL/issues/117). Although only 1/100  elements mismatched, the relative tolerance diff was quite substantial on that element (`1.46476215`) vs. previous NumPy/OpenBLAS combinations. This failure showed up starting with NumPy `2.5.0` from 3 days ago, which shipped with a newer version of OpenBLAS.

I then dissected out a plain text, but rather lengthy, reproducer on the NumPy level. I've pushed it up to GitHub at https://github.com/tylerjereddy/pinv_issue, and it prints out a single element of `np.pinv` output from a fixed 2D input. The input is a `(25, 25)` array carefully dissected out of our in house testsuite--the size of that array in plain text is the main reason I've provided it in a separate repo--it is perhaps too large to dump here and so you can confirm it is plain text, etc.

Below are the results it prints in a few different scenarios -- the main problem with consistency appears to be with the OpenBLAS that ships with the latest x86_64 NumPy Linux wheels, since building NumPy `2.5.0` from source with an older OpenBLAS passes our testing just fine. I realize that these are just "proxy indicators" for the problem rather than the full picture, especially since different values on different platforms and with different linalg backends *are* acceptable, but nonetheless, the outcome on NumPy `2.5.0` with OpenBLAS from wheel is not acceptable in the broader testing, so I tried to cut this down to a proxy reproducer. Just to emphasize here--it is "ok" that some of the values in the table below are different, but the `0.3609...` value with NumPy `2.5.0` + the OpenBLAS that ships with that wheel, is *not* ok (though this may only be part of the total numerical instability story). It is likely that the input array is pathological in a variety of ways, and this is not uncommon for batched input over a design matrix. Nonetheless, most other platforms and linalg backends and older OpenBLAS versions are doing just fine here in the broader numerical testing we do, and I wanted to keep the proxy test focused on a single float value for now.

| NumPy version  | Result |
| ------------- | ------------- |
| NumPy `2.5.0` wheel on x86_64 Linux | `0.360954047299216` |
| NumPy `2.5.0` built from source on x86_64 Linux with OpenBLAS `0.3.20` |  `0.5567409444424644` (passes our broader numerical testing)  |
| NumPy `2.4.6` wheel on x86_64 Linux | `0.08970465208832239` |
| NumPy `2.3.4` wheel on x86_64 Linux|  `0.08970465208832239`|
| NumPy `2.5.0` wheel on M series Mac | `0.1422214449163197` |
| NumPy `2.4.6` wheel on M series Mac | `0.1422214449163197` |

I then talked to @seberg and @ngoldbaum and they suggested I bisect on OpenBLAS and open an issue here, so I bisected OpenBLAS alongside NumPy `2.5.0` to find the first point where x86_64 Linux + OpenBLAS produces the problem value above with NumPy `2.5.0` (the `0.3609...` value) from wheel install (see below) when linked to NumPy. In fact, I ended up needing to use `LD_PRELOAD` to get the right behavior (force NumPy to link the correct source-built OpenBLAS, pkg config env vars/settings were not sufficient on their own...). I realize that this isn't a reproducer at the C level for OpenBLAS, and that these single proxy floats are just part of the story, but thought I'd get this started. I can work toward that if you really need it.

To further complicate matters, some of the commits in the bisection had to be skipped because they failed to build; presumably those commits really are broken for other reasons. So, the best I could do was this output that narrows it down to two possible problem commits
 
```
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
f5f789fc529ffcc4fb7dd5f525fee1e719339a3a
d9bb8f369f5d80cb0b074f01bde156a6dda1f193
We cannot bisect more!
```

Those both reference Reference-LAPACK PRs. I did confirm that the latest `develop` commit (a36e22cbb252c) is still broken (produces `0.360954047299216` using the reproducer above on x86_64 Linux).

Sorry if this is a little tricky to follow, but it is the result of several hours of careful dissection and I did not use AI--this is a real problem affecting our estimator research at DOE, and we're happy to help out if we can, but would probably need a bit more guidance. I wonder if an expert can look at the suspicious commits with a knowledge of the ops used by `pinv` and get a pretty good sense for where the issue is.

NumPy version	Result
NumPy `2.5.0` wheel on x86_64 Linux	`0.360954047299216`
NumPy `2.5.0` built from source on x86_64 Linux with OpenBLAS `0.3.20`	`0.5567409444424644` (passes our broader numerical testing)
NumPy `2.4.6` wheel on x86_64 Linux	`0.08970465208832239`
NumPy `2.3.4` wheel on x86_64 Linux	`0.08970465208832239`
NumPy `2.5.0` wheel on M series Mac	`0.1422214449163197`
NumPy `2.4.6` wheel on M series Mac	`0.1422214449163197`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: Moore-Penrose Pseudoinverse numerical stability/consistency affected by recent OpenBLAS activity #5866

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

BUG: Moore-Penrose Pseudoinverse numerical stability/consistency affected by recent OpenBLAS activity #5866

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions