Chips and Cheese
Inside Qualcomm’s Adreno 530, a Small Mobile iGPU
#ChipAndCheese
Telegraph | source
(author: clamchowder)
Inside Qualcomm’s Adreno 530, a Small Mobile iGPU
#ChipAndCheese
Telegraph | source
(author: clamchowder)
#今日看了什么
https://mp.weixin.qq.com/s?__biz=MzkxMTIyODMwOQ==&mid=2247483873&idx=1&sn=cdf69ba59f3620b6007a8741a89d5155&chksm=c11e2d1bf669a40d5e2303b93cf876ee0e2b260457d7e593963820013a4a9a3767ddb24ce371&mpshare=1&scene=23&srcid=0115DoKHEINFH4gVJyq0oKTh&sharer_shareinfo=50501534f60f60752a078d186ea5da27&sharer_shareinfo_first=448a9c8794f02598e4f161bc66ea4889#rd
https://mp.weixin.qq.com/s?__biz=MzkxMTIyODMwOQ==&mid=2247483873&idx=1&sn=cdf69ba59f3620b6007a8741a89d5155&chksm=c11e2d1bf669a40d5e2303b93cf876ee0e2b260457d7e593963820013a4a9a3767ddb24ce371&mpshare=1&scene=23&srcid=0115DoKHEINFH4gVJyq0oKTh&sharer_shareinfo=50501534f60f60752a078d186ea5da27&sharer_shareinfo_first=448a9c8794f02598e4f161bc66ea4889#rd
Daniel Lemire's blog
Implementing the missing sign instruction in AVX-512
Intel and AMD have expanded the x64 instruction sets over time. In particular, the SIMD (Single instruction, multiple data) instructions have become progressively wider and more general: from 64 bits to 128 bits (SSE2), to 256 bits (AVX/AVX2) to 512 bits (AVX-512). Interestingly, many instructions defined on 256 bits registers through AVX/AVX2 are not available on 512 bits registers.
With SSSE3, Intel introduced sign instructions, with the corresponding intrinsic functions (e.g., _mm_sign_epi8). There are 8-bit, 16-bit and 32-bit versions. It was extended to 256-bit registers in AVX2.
What these instructions do is to apply the sign of one parameter to the other parameter. It is most easily explained as pseucode code:
The SIMD equivalent does the same operation but with many values at once. Thus, with SSSE3 and psignb, you can generate sixteen signed 8-bit integers at once.
You can view is as a generalization of the absolution function: abs(a) = sign(a,b). The sign instructions are very fast. They are used in numerical analysis and machine learning: e.g., it is used in llama.cpp, the open source LLM project.
When Intel designed AVX-512 they decided to omit the sign instructions. So while we have the intrinsic function _mm256_sign_epi8, we don’t have _mm512_sign_epi8. The same instructions are missing for 16 bits and 32 bits integers (e.g., no _m512_sign_epi16 is found).
You may implement it for AVX-512 with a several instructions. I found this one approach:
It is disappointingly expensive. It might compile to four or five instructions:
In practice, you may not need to pay such a high price. The reason the problem is difficult is that we have three cases to handle (three signs b=0, b>0, b<0). If you do not care about the case ‘b = 0’, then you can do it in two instruction:
E.g., we implemented…
source
Implementing the missing sign instruction in AVX-512
Intel and AMD have expanded the x64 instruction sets over time. In particular, the SIMD (Single instruction, multiple data) instructions have become progressively wider and more general: from 64 bits to 128 bits (SSE2), to 256 bits (AVX/AVX2) to 512 bits (AVX-512). Interestingly, many instructions defined on 256 bits registers through AVX/AVX2 are not available on 512 bits registers.
With SSSE3, Intel introduced sign instructions, with the corresponding intrinsic functions (e.g., _mm_sign_epi8). There are 8-bit, 16-bit and 32-bit versions. It was extended to 256-bit registers in AVX2.
What these instructions do is to apply the sign of one parameter to the other parameter. It is most easily explained as pseucode code:
function sign(a, b): # a and b are integers
if b == 0 : return 0
if b < 0 : return -a
if b > 0 : return a
The SIMD equivalent does the same operation but with many values at once. Thus, with SSSE3 and psignb, you can generate sixteen signed 8-bit integers at once.
You can view is as a generalization of the absolution function: abs(a) = sign(a,b). The sign instructions are very fast. They are used in numerical analysis and machine learning: e.g., it is used in llama.cpp, the open source LLM project.
When Intel designed AVX-512 they decided to omit the sign instructions. So while we have the intrinsic function _mm256_sign_epi8, we don’t have _mm512_sign_epi8. The same instructions are missing for 16 bits and 32 bits integers (e.g., no _m512_sign_epi16 is found).
You may implement it for AVX-512 with a several instructions. I found this one approach:
#include <x86intrin.h>
__m512i _mm512_sign_epi8(__m512i a, __m512i b) {
__m512i zero = _mm512_setzero_si512();
__mmask64 blt0 = _mm512_movepi8_mask(b);
__mmask64 ble0 = _mm512_cmple_epi8_mask(b, zero);
__m512i a_blt0 = _mm512_mask_mov_epi8(zero, blt0, a);
return _mm512_mask_sub_epi8(a, ble0, zero, a_blt0);;
}
It is disappointingly expensive. It might compile to four or five instructions:
vpmovb2m k2, zmm1
vpxor xmm2, xmm2, xmm2
vpcmpb k1, zmm1, zmm2, 2
vpblendmb zmm1{k2}, zmm2, zmm0
vpsubb zmm0{k1}, zmm2, zmm1
In practice, you may not need to pay such a high price. The reason the problem is difficult is that we have three cases to handle (three signs b=0, b>0, b<0). If you do not care about the case ‘b = 0’, then you can do it in two instruction:
#include <x86intrin.h>
__m512i _mm512_sign_epi8_cheated(__m512i a, __m512i b) {
__mmask64 blt0 = _mm512_movepi8_mask(b);
return _mm512_mask_sub_epi8(a, blt0, zero, a);;
}
E.g., we implemented…
function sign_cheated(a, b): # a and b are integers
if b ≤ 0 : return -a
if b > 0 : return a
source
The memory remains: Permanent memory with systemd and a Rust allocator https://darkcoding.net/software/rust-systemd-memory-remains/
Arch Linux: Recent news updates
Making dbus-broker our default D-Bus daemon
We are making
For the foreseeable future we will still support the use of
For a more detailed rationale, please see our RFC 25.
source
(author: Jan Alexander Steffens)
Making dbus-broker our default D-Bus daemon
We are making
dbus-broker our default implementation of D-Bus, for improved performance, reliability and integration with systemd.For the foreseeable future we will still support the use of
dbus-daemon, the previous implementation. Pacman will ask you whether to install dbus-broker-units or dbus-daemon-units. We recommend picking the default.For a more detailed rationale, please see our RFC 25.
source
(author: Jan Alexander Steffens)
totally_safe_transmute, Line-by-Line (2021) https://blog.yossarian.net/2021/03/16/totally_safe_transmute-line-by-line
Chips and Cheese
Maxwell: Nvidia’s Silver 28nm Hammer
#ChipAndCheese
Telegraph | source
(author: clamchowder)
Maxwell: Nvidia’s Silver 28nm Hammer
#ChipAndCheese
Telegraph | source
(author: clamchowder)
WPA3 Enterprise 192-bit mode at home https://smallstep.com/blog/home-network-eap-tls-wifi/
《1个时钟周期》
Harry Chen’s Blog
在 Debian 上配置 Configless Slurm
Slurm 在 20.02 之后增加了 Configless 的功能,也就是说不需要在每一个运行 slurmd 的结点维护所有的配置文件了。 这对于 HPC 集群的运维来说肯定是好消息。原本需要时刻保持 N 份配置文件相同,否则就容易产生玄学而难以诊断的问题,而一致性永远是计算机科学中的难题。 现在只需要在 slurmctld 对应的控制结点上维护一份配置,其他结点的 slurmd 启动时会自动拉取最新的配置,而运行时 reconfig 也不用担心受到本地配置的影响。
Slurm 的文档指出,实现 configless 满足进行以下要求:
1. 使得 slurmd 能找到 slurmctld:可以通过 DNS SRV 记录或者启动时传递
2. 如果使用 SRV 记录,需要保证 slurmd 启动时本地没有任何配置(因为 搜索顺序 中 SRV 记录优先级最低)。
由于我们的集群中有不止一套 Slurm,也就需要给不同的 slurmd 指定不同的 slurmctld,简单起见我选择了传参的方案。以 Debian 的
修改
尽管按照文档,这样就能工作了,为了保险起见,还可以通过 systemd 对 slurmd 隐藏
由于 Debian 分发的 service unit 中检测了
我按照上面的方法将实验室所有集群替换成了 configless 模式,目前工作一切正常。遇到的唯一问题是 GRES 配置有时无法通过
source
(author: Shengqi Chen ([email protected]))
在 Debian 上配置 Configless Slurm
Slurm 在 20.02 之后增加了 Configless 的功能,也就是说不需要在每一个运行 slurmd 的结点维护所有的配置文件了。 这对于 HPC 集群的运维来说肯定是好消息。原本需要时刻保持 N 份配置文件相同,否则就容易产生玄学而难以诊断的问题,而一致性永远是计算机科学中的难题。 现在只需要在 slurmctld 对应的控制结点上维护一份配置,其他结点的 slurmd 启动时会自动拉取最新的配置,而运行时 reconfig 也不用担心受到本地配置的影响。
Slurm 的文档指出,实现 configless 满足进行以下要求:
1. 使得 slurmd 能找到 slurmctld:可以通过 DNS SRV 记录或者启动时传递
--conf-server 参数达成;2. 如果使用 SRV 记录,需要保证 slurmd 启动时本地没有任何配置(因为 搜索顺序 中 SRV 记录优先级最低)。
由于我们的集群中有不止一套 Slurm,也就需要给不同的 slurmd 指定不同的 slurmctld,简单起见我选择了传参的方案。以 Debian 的
slurm-wlm 为例说明修改:修改
/etc/default/slurmd,添加 --conf-server 参数:SLURMD_OPTIONS="--conf-server your_ctl_server:6817"
尽管按照文档,这样就能工作了,为了保险起见,还可以通过 systemd 对 slurmd 隐藏
/etc/slurm 的配置(而不是真的删除),避免潜在的冲突/混淆问题。运行 systemctl edit slurmd:[Unit]
ConditionPathExists=
[Service]
TemporaryFileSystem=/etc/slurm
由于 Debian 分发的 service unit 中检测了
/etc/slurm/slurm.conf 作为启动条件,因此在 [Unit] 节中通过空配置覆盖来禁用它,然后在 [Service] 节中通过挂载临时文件系统隐藏原有目录。 重启服务后,可以通过 /proc/$(pgrep slurmd)/root/etc/slurm 的内容检查是否正常工作。我按照上面的方法将实验室所有集群替换成了 configless 模式,目前工作一切正常。遇到的唯一问题是 GRES 配置有时无法通过
reconfig 更新,在尝试删除配置 - reconfig - 加回配置 - reconfig 后解决。source
(author: Shengqi Chen ([email protected]))
Chips and Cheese
A New Year and New Tests: GPU L1 Cache Bandwidth
#ChipAndCheese
Telegraph | source
(author: clamchowder)
A New Year and New Tests: GPU L1 Cache Bandwidth
#ChipAndCheese
Telegraph | source
(author: clamchowder)
2024 新年快乐!