V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
YUX
V2EX  ›  macOS

m1 有原生 numpy scipy 了

  •  3
     
  •   YUX · 2020-12-09 15:37:56 +08:00 · 7655 次点击
    这是一个创建于 1485 天前的主题,其中的信息可能已经有所发展或是发生改变。

    https://github.com/conda-forge/miniforge

    先下载对应版本的 Miniforge3, ====> OS X arm64 (Apple Silicon)

    装上之后就有 conda 了,conda 里面装 numpy,scipy 什么的都是原生的

    性能提升很大 无论对比 Rosetta 2 还是 intel i9

    第 1 条附言  ·  2020-12-09 20:17:37 +08:00
    大家来分享一下各自的 benchmark 吧😂
    42 条回复    2021-04-23 04:02:49 +08:00
    pb941129
        1
    pb941129  
       2020-12-09 15:39:45 +08:00 via iPhone
    想知道对比 Intel i9 mkl 版 numpy 提升多少……
    NoobX
        2
    NoobX  
       2020-12-09 16:42:16 +08:00 via iPhone
    然而 16g 封顶...
    Goldilocks
        3
    Goldilocks  
       2020-12-09 16:45:04 +08:00 via Android
    期待 benchmark,估计被 avx512 吊打
    felixcode
        4
    felixcode  
       2020-12-09 19:43:51 +08:00 via Android
    显存比你内存大
    YUX
        5
    YUX  
    OP
       2020-12-09 19:49:07 +08:00
    @pb941129
    @NoobX
    @Goldilocks
    @felixcode



    找到了个 numpy 性能脚本 跑了一下 https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276

    ```
    Dotted two 4096x4096 matrices in 0.53 s.
    Dotted two vectors of length 524288 in 0.25 ms.
    SVD of a 2048x1024 matrix in 0.59 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
    Eigendecomposition of a 2048x2048 matrix in 4.74 s.

    This was obtained using the following Numpy configuration:
    blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
    include_dirs = ['/Users/yux/miniforge3/envs/maths/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
    include_dirs = ['/Users/yux/miniforge3/envs/maths/include']
    language = c
    lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
    language = f77
    lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/yux/miniforge3/envs/maths/include']
    `
    ```




    p.s. python 版本 3.9.1 -arm64 跑的时候关掉了所有后台
    pb941129
        6
    pb941129  
       2020-12-09 19:58:15 +08:00   ❤️ 1
    @YUX Thx 这是我 16 寸 MBP i9 款跑出来的结果。没有关后台。环境 anaconda 3.8 。看上去比 M1 还是快一点的。(不然 Intel 真的要哭)

    ```
    Dotted two 4096x4096 matrices in 0.45 s.
    Dotted two vectors of length 524288 in 0.05 ms.
    SVD of a 2048x1024 matrix in 0.32 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
    Eigendecomposition of a 2048x2048 matrix in 3.53 s.

    This was obtained using the following Numpy configuration:
    blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/xxx/anaconda/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/xxx/anaconda/include']
    blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/xxx/anaconda/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/xxx/anaconda/include']
    lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/xxx/anaconda/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/xxx/anaconda/include']
    lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/xxx/anaconda/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/xxx/anaconda/include']

    ```
    changepc90
        7
    changepc90  
       2020-12-09 20:12:20 +08:00
    M1:Dotted two vectors of length 524288 in 0.25 ms
    MBP16:Dotted two vectors of length 524288 in 0.05 ms.
    这一项差的好多啊。
    YUX
        8
    YUX  
    OP
       2020-12-09 20:13:27 +08:00
    @pb941129 不错还是 i9 强😂 是不是跑的时候 8 核 16 线程都占满了
    YUX
        9
    YUX  
    OP
       2020-12-09 20:15:42 +08:00
    @changepc90 这应该就是指令集差异造成的叭
    Aspector
        10
    Aspector  
       2020-12-09 20:19:41 +08:00   ❤️ 1
    T480s 上的 i7 8550u,库是 mkl_rt

    Dotted two 4096x4096 matrices in 1.07 s.
    Dotted two vectors of length 524288 in 0.13 ms.
    SVD of a 2048x1024 matrix in 0.53 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.15 s.
    Eigendecomposition of a 2048x2048 matrix in 5.07 s.

    用 HWMonitor 读出来 8550u 的实时功耗大概在 40-45W,M1 应该才 20W 吧(悲
    YUX
        11
    YUX  
    OP
       2020-12-09 20:21:59 +08:00
    分享一下朋友的 16inch 2.6 GHz 6-Core Intel Core i7

    Dotted two 4096x4096 matrices in 0.49 s.
    Dotted two vectors of length 524288 in 0.05 ms.
    SVD of a 2048x1024 matrix in 0.32 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.07 s.
    Eigendecomposition of a 2048x2048 matrix in 3.16 s.
    YUX
        12
    YUX  
    OP
       2020-12-09 20:24:36 +08:00
    @Aspector air 的 m1 限制在 10 瓦😂
    pb941129
        13
    pb941129  
       2020-12-09 20:25:33 +08:00 via iPhone
    @YUX 没看任务,不过以我对 numpy 尿性的理解,不至于不至于。可以等 lightgbm 适配了然后一起跑跑 CPU 版本(当时跑一个小项目找最优参数跑满整个 8700k 三小时
    rock_cloud
        14
    rock_cloud  
       2020-12-09 20:25:53 +08:00   ❤️ 1
    2017 iMac 3.4Ghz Intel i5
    Dotted two 4096x4096 matrices in 1.04 s.
    Dotted two vectors of length 524288 in 0.17 ms.
    SVD of a 2048x1024 matrix in 0.58 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.12 s.
    Eigendecomposition of a 2048x2048 matrix in 5.37 s.
    没关任何后台
    YUX
        15
    YUX  
    OP
       2020-12-09 20:26:54 +08:00
    @pb941129 烤鸡仨小时啊 我能在冰箱里测么😂 没风扇怕烤糊了
    sxd96
        16
    sxd96  
       2020-12-09 20:31:25 +08:00   ❤️ 1
    18 年 13 寸 MBP i5-8259U

    Dotted two 4096x4096 matrices in 0.80 s.
    Dotted two vectors of length 524288 in 0.11 ms.
    SVD of a 2048x1024 matrix in 0.35 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
    Eigendecomposition of a 2048x2048 matrix in 3.39 s.
    sxd96
        17
    sxd96  
       2020-12-09 20:35:06 +08:00
    @sxd96 感觉心里平衡了一点点,也是没关后台,mkl 库。但是我发现在核心满负载的情况下,MBP 会有一点点电啸声。虽然现在 ARM 在这上面可能差了一点点,但是如果算能效比,可能并不差。我觉得移动设备重要的还是能效比。
    Gandum
        18
    Gandum  
       2020-12-09 20:35:15 +08:00 via iPhone
    还是初步版本。不过现在是冬天还不用急,风扇不太吵。明年夏天再买。
    IgniteWhite
        19
    IgniteWhite  
       2020-12-09 20:35:29 +08:00 via iPhone   ❤️ 1
    哈哈我五个月前发帖讲过啦 /t/688402
    rock_cloud
        20
    rock_cloud  
       2020-12-09 20:36:02 +08:00   ❤️ 1
    Intel Xeon Silver 4114 2.2Ghz
    Dotted two 4096x4096 matrices in 0.60 s.
    Dotted two vectors of length 524288 in 0.04 ms.
    SVD of a 2048x1024 matrix in 0.66 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.26 s.
    Eigendecomposition of a 2048x2048 matrix in 6.67 s.
    YUX
        21
    YUX  
    OP
       2020-12-09 20:38:09 +08:00   ❤️ 1
    @IgniteWhite 太超前啦😂确实是个好东西
    Tilie
        22
    Tilie  
       2020-12-09 20:54:48 +08:00   ❤️ 1
    8 代 i7 mac mini
    Dotted two 4096x4096 matrices in 0.76 s.
    Dotted two vectors of length 524288 in 0.09 ms.
    SVD of a 2048x1024 matrix in 0.56 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
    Eigendecomposition of a 2048x2048 matrix in 5.20 s.
    YUX
        23
    YUX  
    OP
       2020-12-09 21:03:39 +08:00
    Google Colab - 2 Intel(R) Xeon(R) CPU @ 2.20GHz

    Dotted two 4096x4096 matrices in 4.16 s.
    Dotted two vectors of length 524288 in 0.25 ms.
    SVD of a 2048x1024 matrix in 1.49 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.23 s.
    Eigendecomposition of a 2048x2048 matrix in 13.11 s.
    zr86
        24
    zr86  
       2020-12-09 21:14:01 +08:00
    M1 Mac mini

    Dotted two 4096x4096 matrices in 0.69 s.
    Dotted two vectors of length 524288 in 0.25 ms.
    SVD of a 2048x1024 matrix in 0.68 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
    Eigendecomposition of a 2048x2048 matrix in 4.82 s.
    wydinhk
        25
    wydinhk  
       2020-12-09 22:21:48 +08:00
    M1 MacBook Pro

    Dotted two 4096x4096 matrices in 0.68 s.
    Dotted two vectors of length 524288 in 0.25 ms.
    SVD of a 2048x1024 matrix in 0.71 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
    Eigendecomposition of a 2048x2048 matrix in 5.03 s.

    同时用 powermetrics 测量功耗,前两项约 26W,后三项约 16W
    lovestudykid
        26
    lovestudykid  
       2020-12-10 03:17:17 +08:00
    这个测试拉不开差距
    MF839,只是比楼主的 M1 慢了一倍
    Dotted two 4096x4096 matrices in 2.33 s.
    Dotted two vectors of length 524288 in 0.54 ms.
    SVD of a 2048x1024 matrix in 1.05 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.20 s.
    Eigendecomposition of a 2048x2048 matrix in 8.38 s.


    Intel(R) Xeon(R) Gold 6134
    Dotted two 4096x4096 matrices in 0.32 s.
    Dotted two vectors of length 524288 in 0.05 ms.
    SVD of a 2048x1024 matrix in 0.89 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.15 s.
    Eigendecomposition of a 2048x2048 matrix in 8.19 s.
    Anaconda 默认安装的 numpy 版本没有用 mkl,也没有开启 avx512,这个 cpu 是浪费了
    pubby
        27
    pubby  
       2020-12-10 10:01:09 +08:00
    3700X 黑苹果

    Dotted two 4096x4096 matrices in 0.46 s.
    Dotted two vectors of length 524288 in 0.08 ms.
    SVD of a 2048x1024 matrix in 7.37 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.82 s.
    Eigendecomposition of a 2048x2048 matrix in 49.05 s.

    This was obtained using the following Numpy configuration:
    atlas_threads_info:
    NOT AVAILABLE
    blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '-I/AppleInternal/BuildRoot/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.Internal.sdk/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3)]
    atlas_blas_threads_info:
    NOT AVAILABLE
    openblas_info:
    NOT AVAILABLE
    lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3']
    define_macros = [('NO_ATLAS_INFO', 3)]
    atlas_info:
    NOT AVAILABLE
    lapack_mkl_info:
    NOT AVAILABLE
    blas_mkl_info:
    NOT AVAILABLE
    atlas_blas_info:
    NOT AVAILABLE
    mkl_info:
    NOT AVAILABLE


    使用姿势不太对....
    bnuliujing
        28
    bnuliujing  
       2020-12-10 10:18:09 +08:00
    i7-6950X 的成绩

    Dotted two 4096x4096 matrices in 0.35 s.
    Dotted two vectors of length 524288 in 0.03 ms.
    SVD of a 2048x1024 matrix in 0.27 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.10 s.
    Eigendecomposition of a 2048x2048 matrix in 3.39 s.
    NoobX
        29
    NoobX  
       2020-12-10 11:05:02 +08:00
    Mac Mini i5 款的成绩

    Dotted two 4096x4096 matrices in 0.58 s.
    Dotted two vectors of length 524288 in 0.08 ms.
    SVD of a 2048x1024 matrix in 0.32 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
    Eigendecomposition of a 2048x2048 matrix in 3.30 s.

    M1 成绩印象也不太深刻。。。
    不过 16G 内存依旧是一个大问题,系统一般自己就吃掉 4G,16G 只有 12G 放 dataset,老实讲对我不太够用
    处理器慢点问题不大,swap 吃满了,那速度是真的噩梦
    MisakaTian
        30
    MisakaTian  
       2020-12-10 11:58:25 +08:00
    数据狗表示 anaconda 搞定就上
    Goldilocks
        31
    Goldilocks  
       2020-12-10 12:06:11 +08:00
    Processor Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz, 3600 Mhz, 4 Core

    Dotted two 4096x4096 matrices in 0.33s ,比 m1 快一倍。但是 m1 是 8 核哦。所以同等频率同样核数,intel 还是要比 m1 快 3-4 倍左右,这还是 3 年前的产品。
    YUX
        32
    YUX  
    OP
       2020-12-10 12:12:50 +08:00 via iPhone
    @MisakaTian 用 mamba 啊
    Goldilocks
        33
    Goldilocks  
       2020-12-10 12:18:45 +08:00
    现在是 2020 年。Intel 如果出个 2 核 3.6G 的 cpu,你肯定看不上它的性能。你要想的是 Intel 10 核、20 核。马上 AMD 都要发布 64 核桌面 CPU 了,apple 还停留在 2 核的水准。
    meloyang05
        34
    meloyang05  
       2020-12-10 13:35:48 +08:00
    @Goldilocks

    “8 代 i7 mac mini
    Dotted two 4096x4096 matrices in 0.76 s.
    Dotted two vectors of length 524288 in 0.09 ms.
    SVD of a 2048x1024 matrix in 0.56 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
    Eigendecomposition of a 2048x2048 matrix in 5.20 s.

    M1 Mac mini

    Dotted two 4096x4096 matrices in 0.69 s.
    Dotted two vectors of length 524288 in 0.25 ms.
    SVD of a 2048x1024 matrix in 0.68 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
    Eigendecomposition of a 2048x2048 matrix in 4.82 s.”

    你选择性无视其他测试成绩么。。时间在 ms 级别本来误差就可能很大,也可能是 numpy for m1 现在有 bug,你单独拎 vector 的成绩出来能说明什么问题?
    Goldilocks
        35
    Goldilocks  
       2020-12-10 13:38:09 +08:00
    误差不会很大,一般都在 1%以内。因为矩阵乘法就受两个限制:

    1. CPU flops
    2. 内存带宽
    Goldilocks
        36
    Goldilocks  
       2020-12-10 13:45:33 +08:00
    像矩阵乘法这样的数值计算是很成熟的领域,大家都研究的很透了。请参见这个: https://en.wikichip.org/wiki/flops

    假设内存带宽能跟得上 cpu 的速度,要么要想跑的更快,就只有:
    1. 增加核数
    2. 增加 SIMD 的长度

    比如 skylake 可以做到 64 FLOPs/cycle,但是同时代的 AMD CPU 只有 16 FLOPs/cycle 。大家主频都差不多,这其中的 4 倍就造成了主要的差距。而且这种差距很难追赶上,可以说一辈子都没希望。
    Harry1993
        37
    Harry1993  
       2020-12-10 14:08:58 +08:00
    用 Apple 的 numpy ( https://github.com/apple/tensorflow_macos)試了一下:

    Dotted two 4096x4096 matrices in 0.84 s.
    Dotted two vectors of length 524288 in 0.11 ms.
    SVD of a 2048x1024 matrix in 0.54 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.06 s.
    Eigendecomposition of a 2048x2048 matrix in 6.29 s.
    IgniteWhite
        38
    IgniteWhite  
       2020-12-10 23:07:30 +08:00
    @MisakaTian miniforge 的包管理器不就是 conda 么…只是默认 channel 是 conda-forge
    lly0514
        39
    lly0514  
       2020-12-11 15:35:01 +08:00
    @Goldilocks 实际上误差非常大,我实测 MKL vs openblas 的性能差距有一倍多
    Richardyyz
        40
    Richardyyz  
       2020-12-13 09:58:14 +08:00
    @Goldilocks ZEN2 都已经 32 FLOPs/cycle 了,你这一辈子这么短吗?降频严重的 AVX512 并没有在 ZEN3 面前有多么大的优势。
    YUX
        41
    YUX  
    OP
       2021-01-24 20:05:33 +08:00
    补充一个树莓派的😂

    Dotted two 4096x4096 matrices in 10.18 s.
    Dotted two vectors of length 524288 in 2.27 ms.
    SVD of a 2048x1024 matrix in 6.67 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.85 s.
    Eigendecomposition of a 2048x2048 matrix in 37.83 s.

    This was obtained using the following Numpy configuration:
    blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/root/mambaforge/envs/maths/lib']
    include_dirs = ['/root/mambaforge/envs/maths/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/root/mambaforge/envs/maths/lib']
    include_dirs = ['/root/mambaforge/envs/maths/include']
    language = c
    lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/root/mambaforge/envs/maths/lib']
    language = f77
    lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/root/mambaforge/envs/maths/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/root/mambaforge/envs/maths/include']
    YRInc
        42
    YRInc  
       2021-04-23 04:02:49 +08:00
    提供一个国产的给大家参考:鲲鹏 920

    12 核 鲲鹏 920 24G 内存:
    -------------------
    Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 15:45:16)

    Dotted two 4096x4096 matrices in 1.48 s.
    Dotted two vectors of length 524288 in 0.49 ms.
    SVD of a 2048x1024 matrix in 1.10 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.14 s.
    Eigendecomposition of a 2048x2048 matrix in 8.36 s.
    -------------------


    24 核 鲲鹏 920 48G 内存:
    -------------------
    Dotted two 4096x4096 matrices in 0.76 s.
    Dotted two vectors of length 524288 in 0.48 ms.
    SVD of a 2048x1024 matrix in 0.93 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.13 s.
    Eigendecomposition of a 2048x2048 matrix in 7.66 s.


    与 M1 Mac 用的同样的环境,Miniforge3,相关的加速库如下:
    blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/root/miniforge3/lib']
    include_dirs = ['/root/miniforge3/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/root/miniforge3/lib']
    include_dirs = ['/root/miniforge3/include']
    language = c
    lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/root/miniforge3/lib']
    language = f77
    lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/root/miniforge3/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/root/miniforge3/include']
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1033 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 27ms · UTC 22:46 · PVG 06:46 · LAX 14:46 · JFK 17:46
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.