This is great, a lot of the performance in BLAS comes from memory management. This does not solve my problem though. Sometimes you want to manually control memory, and Julia does not make it easy to do so. In particular when interfacing with BLAS/LAPACK though Julia, memory management is quite ugly, mainly due to the need for allocating work arrays, and a pure Julia implementation of BLAS is unlikely to fix this. I don't think writing eg ARPACK performantly in Julia is imposible, just painful to the point that writing it in FORTRAN starts to make sense (to me).
> I don't think writing eg ARPACK performantly in Julia is impossible
Not sure what you mean, ARPACK just wraps a bunch of calls to LAPACK and BLAS, that's all. It does not have any low-level linalg kernels of its own. Also, its main bottleneck is typically outside of the algorithm, namely matrix-vector products. ArnoldiMethod.jl is pretty much a pure Julia implementation of ARPACK without any dependency on LAPACK (only BLAS when used with Float32/Float64/ComplexF32/ComplexF64). Finally note that you can easily beat ARPACK by adding GPU support, since they don't provide it.
The person I was responding to seemed to have read my comment and thought I had an issue with raw performance. My point above is that writing iterative code that makes many LAPACK calls becomes difficult to write in Julia because there is no way to manage the memory in this situation other than ccall-ing everything yourself, at which point I would rather write FORTRAN. I work on eigenvalue solvers, so it is all more or less just wrapping a bunch of calls to BLAS/LAPACK. As you say, the main bottleneck is in the BLAS calls anyways, but the excess allocation in Julia can really slow you down. ARPACK is maybe a bad example because its all just mat-vec, when you need to do something like compute an SVD every iteration, thats where you run into issues.