Easton Man's Channel

@EastonMan 看的新闻
+碎碎念
+膜大佬
+偶尔猫猫
+伊斯通听的歌

14:46 · Jan 30, 2024 · Tue

08:17 · Jan 30, 2024 · Tue

隔壁群友 @chenyy 发现了 Linux 中 mmap 在 RISC-V 指令集和其他指令集行为不一致的情况，提出了一个 patch。但是维护者十分消极对待这个 patch，且之前引入问题的、质量堪忧的 patch 也是在 review 欠缺的情况下被草率地合并的。
他认为这个问题如果不解决，对之后的 RISC-V 生态是十分灾难的，例如会影响一些内存需求较大的软件在 RISC-V 平台上的使用。因为他的频道是私有频道，有关此问题的更多详细内容请见下面的 telegraph：
https://telegra.ph/Linux-%E4%B8%AD-RISC-V-%E7%9A%84-mmap-%E7%9A%84%E9%87%8D%E5%A4%A7%E9%97%AE%E9%A2%98-01-29
因为他俩吵架已经吵不动了，现在他希望有更多相关的开发者能够参与讨论，无论是支持目前的情况还是支持他的想法，都能去邮件列表发表一下自己的见解。

09:28 · Jan 29, 2024 · Mon

Chips and Cheese
Examining AMD’s RDNA 4 Changes in LLVM
#ChipAndCheese

Telegraph | source
(author: clamchowder)

Telegraph

Examining AMD’s RDNA 4 Changes in LLVM

As 2024 continues on, because time never stops, AMD has been working on their upcoming RDNA 4 architecture. Part of this involves supporting open source projects like LLVM. If done right, merging these changes early will ensure RDNA 4 will be well supported…

ChipAndCheese

03:58 · Jan 29, 2024 · Mon

Matt Keeter
Reverse-engineering the Synacor challenge

source
(author: Matt Keeter (matt.j.keeter@gmail.com))

03:25 · Jan 22, 2024 · Mon

Daniel Lemire's blog
C23: a slightly better C

Telegraph | source

Telegraph

C23: a slightly better C

One of the established and most popular programming languages is the C programming language. It is relatively easy to learn, and highly practical. Maybe surprisingly, the C programming language keeps evolving, slowly and carefully. If you have GCC 23 or LLVM…

18:17 · Jan 21, 2024 · Sun

#JamesAslan #龙芯
https://zhuanlan.zhihu.com/p/678983061

知乎专栏

Apple M2 Blizzard微架构评测(中)：阳春白雪

幕间在上篇中我们主要探究了blizzard在benchmark等负载中的表现，中篇我们继续探究blizzard的前端、后端设计。 JamesAslan：Apple M2 Blizzard微架构评测(上)：阳春白雪前端随着现代程序体量的膨胀，处理器面临越…

JamesAslan 龙芯

03:37 · Jan 21, 2024 · Sun

Chips and Cheese
Inside Qualcomm’s Adreno 530, a Small Mobile iGPU
#ChipAndCheese

Telegraph | source
(author: clamchowder)

Telegraph

Inside Qualcomm’s Adreno 530, a Small Mobile iGPU

GPU architectures vary drastically depending on their primary use cases. Mobile designs like Qualcomm’s Adreno face a daunting set of challenges, with smaller power and area budgets than even Intel’s iGPUs. Mobile SoCs have to share a small SoC die with a…

ChipAndCheese

23:54 · Jan 18, 2024 · Thu

Daniel Lemire's blog
How much memory bandwidth do large Amazon instances offer?

Telegraph | source

Telegraph

How much memory bandwidth do large Amazon instances offer?

In my previous post, I described how you can write a C++ program to estimate your read memory bandwidth. It is not very difficult: you allocate a large memory region and you read it as fast as you can. To see how much bandwidth you may have if you use multithreaded…

18:55 · Jan 15, 2024 · Mon

#今日看了什么
https://mp.weixin.qq.com/s?__biz=MzkxMTIyODMwOQ==&mid=2247483873&idx=1&sn=cdf69ba59f3620b6007a8741a89d5155&chksm=c11e2d1bf669a40d5e2303b93cf876ee0e2b260457d7e593963820013a4a9a3767ddb24ce371&mpshare=1&scene=23&srcid=0115DoKHEINFH4gVJyq0oKTh&sharer_shareinfo=50501534f60f60752a078d186ea5da27&sharer_shareinfo_first=448a9c8794f02598e4f161bc66ea4889#rd

Weixin Official Accounts Platform

报名｜北京大学高性能计算综合能力竞赛

来北京大学高性能计算综合能力竞赛掌控至高力量！

今日看了什么

05:02 · Jan 14, 2024 · Sun

Daniel Lemire's blog
Estimating your memory bandwidth

Telegraph | source

Telegraph

Estimating your memory bandwidth

One of the limitations of a compute is the memory bandwidth. For the scope of this article, I define “memory bandwidth” as the maximal number of bytes you can bring from memory to the CPU per unit of time. E.g., if your system has 5 GB/s of bandwidth, you…

05:59 · Jan 12, 2024 · Fri

Chips and Cheese
Previewing Meteor Lake at CES
#ChipAndCheese

Telegraph | source
(author: clamchowder)

Telegraph

Previewing Meteor Lake at CES

Intel has been using a hybrid core strategy for years in a bid to leverage their bigger engineering budget to corner AMD. Specifically, P-Cores focus on maximizing per-thread performance. E-Cores avoid pursuing diminishing returns and take less power and…

ChipAndCheese

04:07 · Jan 12, 2024 · Fri

Daniel Lemire's blog
Implementing the missing sign instruction in AVX-512

Intel and AMD have expanded the x64 instruction sets over time. In particular, the SIMD (Single instruction, multiple data) instructions have become progressively wider and more general: from 64 bits to 128 bits (SSE2), to 256 bits (AVX/AVX2) to 512 bits (AVX-512). Interestingly, many instructions defined on 256 bits registers through AVX/AVX2 are not available on 512 bits registers.

With SSSE3, Intel introduced sign instructions, with the corresponding intrinsic functions (e.g., _mm_sign_epi8). There are 8-bit, 16-bit and 32-bit versions. It was extended to 256-bit registers in AVX2.

What these instructions do is to apply the sign of one parameter to the other parameter. It is most easily explained as pseucode code:

function sign(a, b): # a and b are integers
   if b == 0 : return 0
   if b < 0 : return -a
   if b > 0 : return a

The SIMD equivalent does the same operation but with many values at once. Thus, with SSSE3 and psignb, you can generate sixteen signed 8-bit integers at once.

You can view is as a generalization of the absolution function: abs(a) = sign(a,b). The sign instructions are very fast. They are used in numerical analysis and machine learning: e.g., it is used in llama.cpp, the open source LLM project.

When Intel designed AVX-512 they decided to omit the sign instructions. So while we have the intrinsic function _mm256_sign_epi8, we don’t have _mm512_sign_epi8. The same instructions are missing for 16 bits and 32 bits integers (e.g., no _m512_sign_epi16 is found).

You may implement it for AVX-512 with a several instructions. I found this one approach:

#include <x86intrin.h>

__m512i _mm512_sign_epi8(__m512i a, __m512i b) {
  __m512i zero = _mm512_setzero_si512();
  __mmask64 blt0 = _mm512_movepi8_mask(b);
  __mmask64 ble0 = _mm512_cmple_epi8_mask(b, zero);
  __m512i a_blt0 = _mm512_mask_mov_epi8(zero, blt0, a);
  return _mm512_mask_sub_epi8(a, ble0, zero, a_blt0);;
}

It is disappointingly expensive. It might compile to four or five instructions:

vpmovb2m k2, zmm1
vpxor xmm2, xmm2, xmm2
vpcmpb k1, zmm1, zmm2, 2
vpblendmb zmm1{k2}, zmm2, zmm0
vpsubb zmm0{k1}, zmm2, zmm1

In practice, you may not need to pay such a high price. The reason the problem is difficult is that we have three cases to handle (three signs b=0, b>0, b&LT0). If you do not care about the case ‘b = 0’, then you can do it in two instruction:

#include <x86intrin.h>

__m512i _mm512_sign_epi8_cheated(__m512i a, __m512i b) {
  __mmask64 blt0 = _mm512_movepi8_mask(b);
  return _mm512_mask_sub_epi8(a, blt0, zero, a);;
}

E.g., we implemented…

function sign_cheated(a, b): # a and b are integers
   if b ≤ 0 : return -a
   if b > 0 : return a

source

10:36 · Jan 11, 2024 · Thu

The memory remains: Permanent memory with systemd and a Rust allocator https://darkcoding.net/software/rust-systemd-memory-remains/

Graham King

The memory remains: Permanent memory with systemd and a Rust allocator

A Rust object that survives program restart thanks to Rust allocators, systemd's file descriptor store, and syscall memfd_create.

01:13 · Jan 10, 2024 · Wed

Arch Linux: Recent news updates
Making dbus-broker our default D-Bus daemon

We are making dbus-broker our default implementation of D-Bus, for improved performance, reliability and integration with systemd.

For the foreseeable future we will still support the use of dbus-daemon, the previous implementation. Pacman will ask you whether to install dbus-broker-units or dbus-daemon-units. We recommend picking the default.

For a more detailed rationale, please see our RFC 25.

source
(author: Jan Alexander Steffens)

12:57 · Jan 9, 2024 · Tue

totally_safe_transmute, Line-by-Line (2021) https://blog.yossarian.net/2021/03/16/totally_safe_transmute-line-by-line

04:11 · Jan 9, 2024 · Tue

Chips and Cheese
Maxwell: Nvidia’s Silver 28nm Hammer
#ChipAndCheese

Telegraph | source
(author: clamchowder)

Telegraph

Maxwell: Nvidia’s Silver 28nm Hammer

Nvidia’s Kepler architecture gave the company a strong start in the 28nm era. Consumer Kepler parts provided highly competitive gaming performance and power efficiency. In the compute market, Kepler had no serious competition thanks to the strong CUDA software…

ChipAndCheese

21:17 · Jan 5, 2024 · Fri

WPA3 Enterprise 192-bit mode at home https://smallstep.com/blog/home-network-eap-tls-wifi/

Smallstep