Easton Man's Channel

Daniel Lemire's blog
Implementing the missing sign instruction in AVX-512

Intel and AMD have expanded the x64 instruction sets over time. In particular, the SIMD (Single instruction, multiple data) instructions have become progressively wider and more general: from 64 bits to 128 bits (SSE2), to 256 bits (AVX/AVX2) to 512 bits (AVX-512). Interestingly, many instructions defined on 256 bits registers through AVX/AVX2 are not available on 512 bits registers.

With SSSE3, Intel introduced sign instructions, with the corresponding intrinsic functions (e.g., _mm_sign_epi8). There are 8-bit, 16-bit and 32-bit versions. It was extended to 256-bit registers in AVX2.

What these instructions do is to apply the sign of one parameter to the other parameter. It is most easily explained as pseucode code:

function sign(a, b): # a and b are integers
   if b == 0 : return 0
   if b < 0 : return -a
   if b > 0 : return a

The SIMD equivalent does the same operation but with many values at once. Thus, with SSSE3 and psignb, you can generate sixteen signed 8-bit integers at once.

You can view is as a generalization of the absolution function: abs(a) = sign(a,b). The sign instructions are very fast. They are used in numerical analysis and machine learning: e.g., it is used in llama.cpp, the open source LLM project.

When Intel designed AVX-512 they decided to omit the sign instructions. So while we have the intrinsic function _mm256_sign_epi8, we don’t have _mm512_sign_epi8. The same instructions are missing for 16 bits and 32 bits integers (e.g., no _m512_sign_epi16 is found).

You may implement it for AVX-512 with a several instructions. I found this one approach:

#include <x86intrin.h>

__m512i _mm512_sign_epi8(__m512i a, __m512i b) {
  __m512i zero = _mm512_setzero_si512();
  __mmask64 blt0 = _mm512_movepi8_mask(b);
  __mmask64 ble0 = _mm512_cmple_epi8_mask(b, zero);
  __m512i a_blt0 = _mm512_mask_mov_epi8(zero, blt0, a);
  return _mm512_mask_sub_epi8(a, ble0, zero, a_blt0);;
}

It is disappointingly expensive. It might compile to four or five instructions:

vpmovb2m k2, zmm1
vpxor xmm2, xmm2, xmm2
vpcmpb k1, zmm1, zmm2, 2
vpblendmb zmm1{k2}, zmm2, zmm0
vpsubb zmm0{k1}, zmm2, zmm1

In practice, you may not need to pay such a high price. The reason the problem is difficult is that we have three cases to handle (three signs b=0, b>0, b&LT0). If you do not care about the case ‘b = 0’, then you can do it in two instruction:

#include <x86intrin.h>

__m512i _mm512_sign_epi8_cheated(__m512i a, __m512i b) {
  __mmask64 blt0 = _mm512_movepi8_mask(b);
  return _mm512_mask_sub_epi8(a, blt0, zero, a);;
}

E.g., we implemented…

function sign_cheated(a, b): # a and b are integers
   if b ≤ 0 : return -a
   if b > 0 : return a

source

10:36 · Jan 11, 2024 · Thu

The memory remains: Permanent memory with systemd and a Rust allocator https://darkcoding.net/software/rust-systemd-memory-remains/

Graham King

The memory remains: Permanent memory with systemd and a Rust allocator

A Rust object that survives program restart thanks to Rust allocators, systemd's file descriptor store, and syscall memfd_create.

01:13 · Jan 10, 2024 · Wed

Arch Linux: Recent news updates
Making dbus-broker our default D-Bus daemon

We are making dbus-broker our default implementation of D-Bus, for improved performance, reliability and integration with systemd.

For the foreseeable future we will still support the use of dbus-daemon, the previous implementation. Pacman will ask you whether to install dbus-broker-units or dbus-daemon-units. We recommend picking the default.

For a more detailed rationale, please see our RFC 25.

source
(author: Jan Alexander Steffens)

12:57 · Jan 9, 2024 · Tue

totally_safe_transmute, Line-by-Line (2021) https://blog.yossarian.net/2021/03/16/totally_safe_transmute-line-by-line

04:11 · Jan 9, 2024 · Tue

Chips and Cheese
Maxwell: Nvidia’s Silver 28nm Hammer
#ChipAndCheese

Telegraph | source
(author: clamchowder)

Telegraph

Maxwell: Nvidia’s Silver 28nm Hammer

Nvidia’s Kepler architecture gave the company a strong start in the 28nm era. Consumer Kepler parts provided highly competitive gaming performance and power efficiency. In the compute market, Kepler had no serious competition thanks to the strong CUDA software…

ChipAndCheese

21:17 · Jan 5, 2024 · Fri

WPA3 Enterprise 192-bit mode at home https://smallstep.com/blog/home-network-eap-tls-wifi/

Smallstep

You shouldn’t run NSA-grade Wi-Fi at home. Here’s how to do it

For our 2023 holiday project, we're setting up an WPA3 Enterprise certificate-authenticated Wi-Fi network at home! And when your family from out of town asks to "jump on the Wi-Fi real quick," you'll learn why this type of network is such a hassle to manage.

18:48 · Jan 3, 2024 · Wed

Easton Man's Channel

难以想象明天考试有什么离谱题目

《1个时钟周期》

18:47 · Jan 3, 2024 · Wed

难以想象明天考试有什么离谱题目

09:47 · Jan 3, 2024 · Wed

#今日看了什么
https://github.com/dendibakh/perf-book/releases/download/Q4.2023/Performance.Analysis.and.Tuning.on.Modern.CPUs.Q4.2023.pdf
新版草稿release了

今日看了什么

23:13 · Jan 2, 2024 · Tue

Harry Chen’s Blog
在 Debian 上配置 Configless Slurm

Slurm 在 20.02 之后增加了 Configless 的功能，也就是说不需要在每一个运行 slurmd 的结点维护所有的配置文件了。这对于 HPC 集群的运维来说肯定是好消息。原本需要时刻保持 N 份配置文件相同，否则就容易产生玄学而难以诊断的问题，而一致性永远是计算机科学中的难题。现在只需要在 slurmctld 对应的控制结点上维护一份配置，其他结点的 slurmd 启动时会自动拉取最新的配置，而运行时 reconfig 也不用担心受到本地配置的影响。

Slurm 的文档指出，实现 configless 满足进行以下要求：

1. 使得 slurmd 能找到 slurmctld：可以通过 DNS SRV 记录或者启动时传递 --conf-server 参数达成；
2. 如果使用 SRV 记录，需要保证 slurmd 启动时本地没有任何配置（因为搜索顺序中 SRV 记录优先级最低）。

由于我们的集群中有不止一套 Slurm，也就需要给不同的 slurmd 指定不同的 slurmctld，简单起见我选择了传参的方案。以 Debian 的 slurm-wlm 为例说明修改：

修改 /etc/default/slurmd，添加 --conf-server 参数：

SLURMD_OPTIONS="--conf-server your_ctl_server:6817"

尽管按照文档，这样就能工作了，为了保险起见，还可以通过 systemd 对 slurmd 隐藏 /etc/slurm 的配置（而不是真的删除），避免潜在的冲突/混淆问题。运行 systemctl edit slurmd：

[Unit]
ConditionPathExists=

[Service]
TemporaryFileSystem=/etc/slurm

由于 Debian 分发的 service unit 中检测了 /etc/slurm/slurm.conf 作为启动条件，因此在 [Unit] 节中通过空配置覆盖来禁用它，然后在 [Service] 节中通过挂载临时文件系统隐藏原有目录。重启服务后，可以通过 /proc/$(pgrep slurmd)/root/etc/slurm 的内容检查是否正常工作。

我按照上面的方法将实验室所有集群替换成了 configless 模式，目前工作一切正常。遇到的唯一问题是 GRES 配置有时无法通过 reconfig 更新，在尝试删除配置 - reconfig - 加回配置 - reconfig 后解决。

source
(author: Shengqi Chen ([email protected]))

01:22 · Jan 2, 2024 · Tue

Chips and Cheese
A New Year and New Tests: GPU L1 Cache Bandwidth
#ChipAndCheese

Telegraph | source
(author: clamchowder)

Telegraph

A New Year and New Tests: GPU L1 Cache Bandwidth

In my past articles on GPUs, I didn’t have good measurements for L1 cache bandwidth. Microbenchmarking cache bandwidth is harder on GPUs than CPUs. That’s because programming GPUs in assembly code is impractical. GPU instruction sets change between manufacturers…

ChipAndCheese

20:59 · Jan 1, 2024 · Mon

dramforever's blog
Threaded code explained in C

Telegraph | source

Telegraph

Threaded code explained in C

At some point in your life you may have decided that it would be a good idea to represent something in term of a “virtual machine”. You know, a relatively simple format of data that encodes things to do, and a simple interpreter reading it and doing the actual…

00:00 · Jan 1, 2024 · Mon