Easton Man's Channel

@EastonMan 看的新闻
+碎碎念
+膜大佬
+偶尔猫猫
+伊斯通听的歌

13:47 · Aug 10, 2024 · Sat

Back into business

08:58 · Aug 8, 2024 · Thu

04:59 · Aug 5, 2024 · Mon

Chips and Cheese
Cortex A73’s Not-So-Infinite Reordering Capacity
#ChipAndCheese

Telegraph | source
(author: clamchowder)

Telegraph

Cortex A73’s Not-So-Infinite Reordering Capacity

Cortex A73 aimed to address the power and thermal issues that prevented Arm’s early 64-bit cores from reaching their full potential. It started a trend that saw Arm successfully capture the smartphone CPU market, and did so by emphasizing efficiency. Part…

ChipAndCheese

07:31 · Aug 4, 2024 · Sun

Daniel Lemire's blog
Converting ASCII strings to lower case at crazy speeds with AVX-512

Telegraph | source

Telegraph

Converting ASCII strings to lower case at crazy speeds with …

AMD Zen 4 and Zen 5, as well as server-side recent Intel processors, support an advanced set of instructions called AVX-512. They are powerful SIMD (Single Instruction, Multiple Data) instructions. Importantly, they allow ‘masked’ operations. That is, you…

22:21 · Aug 2, 2024 · Fri

杰哥的{运维，编程，调板子}小笔记

分支预测的 2-taken 和 2-ahead¶

背景¶

随着 Zen 5 的推出，更多 Zen5 的架构设计细节被公开，可以看到 Zen 5 前端出现了令人瞩目的变化：引入了 2-taken, 2-ahead 分支预测的设计。这是什么意思？它架构上是怎么实现的？可以带来哪些性能提升？

背景知识¶

首先还是回顾一下处理器前端在做的事情：根据 PC，从 ICache 读取指令，然后译码，发给后端取执行。但是执行的的指令里有大量的分支指令，它们会改变 PC，后续指令需要从新的 PC 处取，但是取指的时候并不知道分支指令未来会如何跳转，如果每次分支指令都要刷流水线重新取指，就会产生很多的流水线空泡，因此就有了分支预测。

分支预测和取指是同时进行的：取指令的同时，也在预测这些指令是否会跳转，如果会跳转，跳转的目的地址是多少，用于指导下一个周期从哪里取指。为了做到这个事情，首先需要知道有没有分支或跳转指令，这个信息会保存在 BTB 中，或者要等到取指完成，译码后才知道哪些是分支或跳转指令。如果有，对于条件分支指令来说，需要一个方向预测器（CBP），判断分支是否会跳转，还需要一个分支目的地址缓存（BTB），如果分支要跳的话，知道要跳到什么地方。除了条件分支指令以外，针对 return 指令，跳转的地址和此前 call 对应，需要记录调用的返回地址栈（RAS）。针对其他的间接跳转指令，例如函数指针调用，一个跳转可能有多个目的地址，还需要一个针对间接跳转的目的地址的预测器（IBP）。这些组件（CBP、BTB、RAS 和 IBP）构成了现代处理器的分支预测器。

在此基础上，目前比较流行分离式/解耦式前端（Decoupled Frontend）：和耦合/非分离式前端相对，耦合前端是说分支预测器和指令缓存紧密协作，分支预测器指导下一次取指的地址，取出的指令立即用于分支预测器。分离式前端把分支预测器变成了生产者，生产取指的地址，然后指令缓存是生产者，消费取指的地址，从缓存读取指令，进行后续的译码，消费者和生产者之间通过队列（Fetch Target Queue）隔开。这样，分支预测器可以独立指令缓存工作，在前面抢跑，即使指令缓存出现了缺失，也可以继续预测未来很多个指令之后的分支。更进一步，还可以根据抢跑的这些分支的信息，提前把指令从 L2 缓存预取到 L1 指令缓存，那么未来指令缓存要取指令的时候，大概率已经在缓存当中了。

当然了，解耦式前端的抢跑也是有代价的：此时分支预测器对未来取出的指令实际上会是什么样是不知道的，只能依赖 BTB 中记录的历史信息，所以 BTB 一般都会做的比较大。但与此同时，耦合式前端可以在 L1 指令缓存从 L2 加载指令时做预译码，找到其中的分支，然后拿 L1 指令缓存作为更大的 BTB，例如 Apple M1 Firestorm 就可以拿巨大的 192KB L1 指令缓存作为 BTB，等效 BTB 容量特别巨大。孰好孰坏，现在还看不清楚。

那么一个分支，从分支预测，到取指，执行，会经历哪些阶段呢？首先是分支预测，分支预测器会把那些会跳转的情况找出来，因为它会影响下一次取指的地址；如果没有分支跳转，或者有分支但是不跳转，那就比较简单，下一次取指地址就直接顺着地址往下算就可以。取指译码以后，会和之前预测的情况做比对，如果发现预测成了分支，结果实际上不是分支，说明分支预测器错了，及时修正。执行的时候，按照分支指令的操作数，实际判断一下要不要跳转，和之前预测的结果比对。如果对了，那就皆大欢喜；如果错了，那就要刷掉那些错误预测的指令。当然还要通知一下分支预测器，让他更新预测的计数器。

接下来回到本文的主题：分支预测最近几年来比较大的一些改动。

2 branch¶

刚才提到，分支参与到预测，取指，执行等阶段之中，其中执行阶段是比较简单的，所以比较容易扩展，例如 Cortex-A77 引入了第二个分支执行单元，每个周期可以执行两条分支指令，目前很多高性能处理器都采用了两个分支执行单元。但大部分处理器每个周期只能预测一个分支，这样每个周期只用访问一次 BTB 等结构，那似乎两个分支执行单元没有什么用？毕竟如果第一个分支跳转了，那第二个分支的地址需要依赖第一个分支的目的地址计算得出，这样这两个分支的预测就一定程度上就串行化了，这个是比较困难的。但如果第一个分支不跳转，去预测第二个分支跳转或者不跳转，这个还是相对比较好支持的。这样，分支预测时，每个周期可以最多预测一个 taken 分支，但同时还可以有 not taken 分支。此外，还有那种从来没有 taken 过的分支，这种一般为了节省 BTB 存储，一般是不记录在 BTB 内部的。考虑到这些情况，设计两个分支执行单元会有一些收益。

但是为什么没有增加到三个呢？还真有，Zen 5 就增加到了三个分支执行单元，但是增加到三个的前提是每周期可以预测两个 taken 分支，否则性能收益很小。这是怎么做到的呢？下面我们来讨论这个问题。

2-taken¶

刚才提到，想要进一步提升分支预测和执行能力，需要支持每个周期预测更多的 taken 分支，刚才是 1 个，现在就要 2 个。ARM 在 Cortex-A78 上添加了 2-taken 分支预测的支持，也就是可以每周期最多可以预测两个 taken 的分支。这是怎么做到的？如果做的非常通用，就要像上面说的那样，先预测第一个分支，拿第一个分支预测的结果，再去预测第二个分支，这件事情要在一个周期内完成，这个挑战是很大的。我们来分析一下：

假如现在有四个基本块 A、B、C 和 D，并且按照这个顺序执行，也就是说，A 最后的指令是一个分支，跳转到 B，同理 B 跳转到 C，C 跳转到 D。经典的分支预测算法，用 A 去预测 B，用 B 去预测 C，用 C 去预测 D，这样每个周期预测一个 taken 分支。那么如果要实现 2-taken 预测算法，假如已知了 A，那就要预测 B 和 C，但是就必须先拿 A 预测 B，再拿 B 预测 C，这样就串行了，时序很难保证。当然也可以同时搞两套预测，一套用 A 预测 B，一套用 A 预测 C，但是这样每个分支要记录的信息就翻倍了。

在论文 Multiple-Block Ahead Branch Predictors 中可以看到另一种更优雅的做法：已知 A 和 B，用 A 去预测 C，用 B 去预测 D。此时分支预测的就是间隔一次以后的目的地址，而不是直接的目的地址，这样的设计下，BTB 等结构需要变成双端口，这样才能同时预测两个分支：A 和 B。预测出 C 和 D 以后，再用同样的办法去预测 E 和 F，这样持续下去。当然论文设计的比这里讲的更复杂一点，具体细节见论文。

我们不知道 ARM 具体如何实现的 2-taken，但是可以猜想它做了一些限制，例如虽然两个分支都是 taken，但是可能对偏移、地址有一些限制。Intel 的 Golden Cove 架构，AMD 的 Zen 4 架构也实现了 2-taken。

虽然做了 2-taken，只是分支预测的带宽增加了，每个周期可以预测更多的分支。但前面也提到了，分支预测器是生产者，指令缓存是消费者，生产者的性能提升了，那么消费者的性能也要相应提升才是。但是指令缓存是一片很大的 SRAM，功耗和时序都比较麻烦，所以改起来比较困难。如果单纯增加指令缓存一次取指的宽度，例如 8 字节提升到 16 字节，对于分支密度低的情况比较有效，但如果分支很多，那么这样效果也不会很好，要提升性能，就要考虑双端口，每个周期从两个不同的地址取指。这就是 Zen 5 做的事情。

2-ahead¶

Zen 5 除了 2-taken 以外，还实现了 2-ahead，也就是每个周期可以从两个地址取指令，分别译码，然后再拼起来。此外，Zen 5 的双取指还可以服务于超线程，每个线程用一个取指流水线。当超线程空闲的时候，两个取值流水线可以服务于同一个线程，提供更高的性能。

目前 Zen 5 的实现细节还不清晰，期待未来更多的微架构解析。

参考文献¶

● Zen 5’s 2-Ahead Branch Predictor Unit: How a 30 Year Old Idea Allows for New Tricks
● Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence - Anandtech
● AMD Zen 5 Technical Deep Dive
● AMD Zen 5 Architecture Reveal: A Ryzen 9000 And Ryzen AI 300 Deep Dive
● AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more
● Optimizations Enabled by a Decoupled Front-End Architecture
● The Cortex-A77 µarch: Added ALUs & Better Load/Stores
● Multiple-Block Ahead Branch Predictors
● Popping the Hood on Golden Cove
● AMD Zen 4 Ryzen 9 7950X and Ryzen 5 7600X Review: Retaking The High-End

source

20:38 · Aug 2, 2024 · Fri

Matt Keeter
Panic! At the Async Runtime Shutdown

source
(author: Matt Keeter (matt.j.keeter@gmail.com))

01:56 · Aug 1, 2024 · Thu

Chips and Cheese
Grace Hopper, Nvidia’s Halfway APU
#ChipAndCheese

Telegraph | source
(author: clamchowder)

Telegraph

Grace Hopper, Nvidia’s Halfway APU

Nvidia and AMD are the biggest players in the high performance GPU space. But while Nvidia has a huge GPU market share advantage over AMD, the latter’s CPU prowess makes it a strong competitor. AMD can sell both a CPU and GPU as one unit, and that capability…

ChipAndCheese

15:34 · Jul 30, 2024 · Tue

杰哥的{运维，编程，调板子}小笔记

在 Surface Laptop 7 上运行 Debian Linux¶

背景¶

最近借到一台 Surface Laptop 7 可以拿来折腾，它用的是高通 Snapdragon X Elite 处理器，跑的是 Windows on Arm 系统。但作为 Linux 用户，肯定不满足于 WSL，而要裸机上安装 Linux。由于这个机器太新，所以安装的过程遇到了很多坎坷。

上游进展¶

目前 X Elite 处理器的上游支持已经逐步完善，但是还是需要很新的内核，也就是最近才合并了 X Elite 的两个笔记本的 device tree 支持进内核。我用的是 v6.11-rc1-43-g94ede2a3e913 版本的内核，目前可以正常显示，Wi-Fi 正常，USB Type-C 口正常工作（键盘，鼠标，有线网都可以通过 USB 接到电脑上），内置的键盘、触摸板和触摸屏不工作。希望后续可以获得更好的硬件支持。

折腾过程¶

高通和 Linaro 在去年的时候推出了一个实验性的 Debian Installer Image：https://git.codelinaro.org/linaro/qcomlt/demos/debian-12-installer-image，它针对的设备是高通自己的 CRD 设备，和 Surface Laptop 7 不同。自然，把这个 image 写到 U 盘里并启动是不行的。

需要注意的是，Surface Laptop 7 默认安装了 Windows，并且开启了安全启动，而我们自己编译的 Linux 内核自然是过不了安全启动的，所以要去固件关闭安全启动。由于 Windows 的 Bitlocker 默认是打开的，请先保证你可以获取 Bitlocker recovery key，不然之后可能进不去 Windows 系统了。安装双系统前，记得在 Windows 里准备好分区表，空间不够的话，可以在线缩小 NTFS。

进入固件的方法：按住音量上键开机。开机后，可以看到 Surface UEFI 的界面，可以调启动顺序，也可以关闭安全启动。为了安装方便，建议把 USB Storage 放到第一个。

接着就开始启动 U 盘里的 Debian Installer Image 了。启动以后，可以看到进入了 grub shell，目测是 grub 找不到自己的配置文件，可以在 (hd1,msdos1)/boot/grub 下面找到。但是这个 image 的 device tree 和 kernel 都比较老，直接启动会发现，Debian Install 进去了，但是内置键盘和外置 USB 键盘都不工作，于是没法进行进一步的安装。

这时候，在网上搜索了一下已有的在 X Elite 上运行 Linux 的尝试，发现有人在 ASUS 的 X Elite 笔记本上装好了（来源），我就试着用 ASUS 对应型号笔记本的 device tree 去启动，依然不行，经过了解后（感谢 @imbushuo），得知 Surface 的内置键盘等外设需要通过 SAM 访问，需要额外的配置，目前不确定能否通过 device tree 启用。

但很快也发现有人在 Surface Laptop 7 上跑起来了（来源），我发邮件问了这个作者，作者说他用的是外置的键盘，内置的键盘也不工作。放大观察作者录的视频，发现用的是最新的 master 分支的 Linux 内核，并且用的就是 CRD 的 device tree。到这里就比较有思路了：自己编译一个内核，然后用 x1e80100-crd.dtb 作为 device tree。

于是魔改了 Debian Installer Image：替换掉 linux 内核，换成自己编译的最新版，解开 initrd，把里面的 kernel modules 也换成新内核的版本，再把新的 x1e80100-crd.dtb 复制上去，再用 grub 启动新内核 + 新 initrd + 新 Device Tree，发现 USB 外接键盘工作了！虽然只有 Type-C 工作，但是也足够完成剩下的工作了。

不过在安装 Debian 的时候，还遇到了小插曲：glibc 版本不够新，估计是 Linaro 的 Image 太老了。于是我从新的 debian arm64 里复制了 libc.so.6 和 ld-linux-aarch64.so.1，覆盖掉 initrd 里的旧版本，这样就好了。

安装完以后，安装的系统里的内核是 debian 的最新内核，但是不够新，于是又老传统：手动 arch-chroot 进新的 sysroot，安装新的内核。也可以像 Linaro 仓库里指出的那样，直接替换 Debian Installer Image 里的 deb，但是我发现我打的 deb 太大了（毕竟 defconfig），放不进文件系统，只好最后自己手动装。

最后在 grub 配置里添加 devicetree 加载命令，再从 Debian Installer Image 的 grub 配置偷 linux cmdline，最终是 grub 配置是这个样子：

devicetree /boot/x1e80100-crd.dtbecho 'Loading Linux 6.11.0-rc1-00043-g94ede2a3e913 ...'linux /boot/vmlinuz-6.11.0-rc1-00043-g94ede2a3e913 root=UUID=aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee ro efi=novamap pd_ignore_unused clk_ignore_unused fw_devlink=off cma=128M quietecho 'Loading initial ramdisk ...'initrd /boot/initrd.img-6.11.0-rc1-00043-g94ede2a3e913

这样搞完，Debian 系统就正常起来了！

本文也发到了 Reddit 上：https://www.reddit.com/r/SurfaceLinux/comments/1efmyb3/managed_to_install_baremetal_linux_on_snapdragon/

source

23:49 · Jul 28, 2024 · Sun

Daniel Lemire's blog
Evolution of iPhone storage capacity

People who should know better often underestimate how fast our storage capacity has grown. We have been able to get 1 TB of storage on iPhones for the last three generations.

Further reading: In 2011, I predicted that the iPhone would have 1TB of storage in 2020.

source

05:38 · Jul 28, 2024 · Sun

Daniel Lemire's blog

Storage costs are plummeting like a skydiver in freefall—between 10 and 100 times cheaper with each passing decade. Meanwhile, the programmer population is growing at a leisurely pace, like a tortoise in a marathon, increasing by about 50% per decade. And the Linux kernel? It is maybe doubling in size every ten years. The net result: we are using storage of data (videos, images, model weights) while code is taking a backseat, fading into the background.

We just cannot code fast enough to fill our increasingly large disks.

source

01:42 · Jul 28, 2024 · Sun

Daniel Lemire's blog
How big are your docker images?

Docker is a standard to deploy software on the cloud. Developers start with an existing image and add their own code before deploying their systems. How big are typical uncompressed images?

Method: docker inspect -f "{{ .Size }}" docker.io/library/myimage

source

20:59 · Jul 27, 2024 · Sat

#Music
https://youtu.be/pVSf5QlSmA8

YouTube

All Things Bright and Beautiful (John Rutter) - National Taiwan University Chorus

Performed on 2011/07/12, Taipei, Taiwan
Conductor：吳采頻
Pianist：林雅婷
Composer：John Rutter

All Things Bright and Beautiful〈萬物光彩絢爛〉

作曲家約翰．魯特（John Rutter, 1945－）的創作多屬合唱類型，且以宗教音樂為主，BBC音樂雜誌讚譽其為英國近代史上最成功、最知名的合唱音樂作曲家。

〈萬物光彩絢爛〉原為一首基督教的讚美詩，因為當時西敏合唱學院（Westminster…

Music

08:23 · Jul 27, 2024 · Sat

Daniel Lemire's blog
How much of your binary executable is just ASCII text?

We sometimes use binary executable which can span megabytes. I wondered: how much text is contained in these binary files? To find out, I wrote a Python script which adds up the size of all sequences of more than 16 ASCII characters in the file.

My heuristic is simple but is not quite perfect: some long sequences might not be actual text and some short ASCII strings might be missed. Nevertheless, it should be good enough to get some insight.

I downloaded macOS binaries for some popular JavaScript runtimes. I find that tens of megabytes are used for what is likely ASCII strings.

source

02:25 · Jul 27, 2024 · Sat

Chips and Cheese
Zen 5’s 2-Ahead Branch Predictor Unit: How 30 Year Old Idea Allows for New Tricks
#ChipAndCheese

Telegraph | source

Telegraph

Zen 5’s 2-Ahead Branch Predictor Unit: How 30 Year Old Idea …

When I recently interviewed Mike Clark, he told me, “…you’ll see the actual foundational lift play out in the future on Zen 6, even though it was really Zen 5 that set the table for that.” And at that same Zen 5 architecture event, AMD’s Chief Technology…

ChipAndCheese

00:25 · Jul 27, 2024 · Sat

近期打算搞个newsletter，主要内容为一些近期体系结构/性能优化的新闻/Reading List，更新频率不固定，但不会高于一周一次，预期可能两周到一个月一次，也许信息密度将会比tg channel高一些。内容形式将会参考Dennis Bakhvalov的Perf letter。
感兴趣的朋友可subscribe。

https://newsletter.eastonman.com/subscription/form

由于使用的邮件服务器没有reputation（新建的），确认订阅的邮件也可能进垃圾箱，如果想确保能收到件可以将noreply#eastonman.com加入白名单。如果未来的邮件进入垃圾箱，也请您帮忙移动到正常的收件箱，因为大部分邮件服务商都有learning based的垃圾邮件过滤算法，移动邮件有助于提高我邮件服务器的reputation。

23:42 · Jul 26, 2024 · Fri

Daniel Lemire's blog
Safer code in C++ with lifetime bounds

For better performance in software, we avoid unnecessary copies. To do so, we introduce references (or pointers). An example of this ideas in C++ is the std::string_view class. As the name suggests, a std::string_view instance is merely a ‘view’: it points at some string, but it does not own or otherwise manage the underlying memory.

The downside is that we must track ownership: when the owner of the memory is gone, we should not be left holding the std::string_view instance. With modern tools, it is trivial to detect such bugs (e.g., using a sanitizer). However, it would be nicer if the compiler could tell us right away.

A few C++ compilers (Visual Studio and LLVM) support lifetime-bound annotation to help us. Let us consider an example. Suppose you would like to parse a URL (e.g., ‘https://www.google.com/path’), but you are only interested in the host (e.g. ‘www.google.com’). You might write code like so using the ada-url/ada parsing library:

std::string_view get_host(std::string_view url_string) {
  auto url = ada::parse(url_string).value();
  return url();
}

This code is not generally safe. The parser will store the result of the parse inside a temporary object but you are returning an std::string_view which points at it. You have a dangling reference.

To get a recent version of LLVM/clang (18+) to warn us, we just need to annotate the function get_host like so:

#ifndef __has_cpp_attribute
    #define ada_lifetime_bound
#elif __has_cpp_attribute(msvc::lifetimebound)
    #define ada_lifetime_bound [[msvc::lifetimebound]]
#elif __has_cpp_attribute(clang::lifetimebound)
    #define ada_lifetime_bound [[clang::lifetimebound]]
#elif __has_cpp_attribute(lifetimebound)
    #define ada_lifetime_bound [[lifetimebound]]
#else
    #define ada_lifetime_bound
#endif

...

std::string_view get_host() const noexcept ada_lifetime_bound;

And then we get a warning at compile time:

fun.cpp:8:10: warning: address of stack memory associated with local variable 'url_aggregator' returned [-Wreturn-stack-address]
    8 |   return url_aggregator.get_host();

It is hardly perfect at this point in time. It does not always warn you, but progress is being made. This feature and others will help us catch errors sooner.

Credit: Thanks to Denis Yaroshevskiy for making me aware of this new compiler feature.

source

12:57 · Jul 24, 2024 · Wed

#今日看了什么
https://zh.m.wikipedia.org/wiki/%E5%8C%97%E4%BA%AC%E5%AE%A3%E8%A8%80_(2024%E5%B9%B4)

06:40 · Jul 23, 2024 · Tue

Chips and Cheese
Arm’s Neoverse V2, in AWS’s Graviton 4
#ChipAndCheese

Telegraph | source
(author: clamchowder)

Telegraph

Arm’s Neoverse V2, in AWS’s Graviton 4

Amazon Web Services (AWS) is the largest cloud provider, and an early Arm server adopter. AWS started investing into the Arm server ecosystem in 2018 with Graviton 1, which used 16 Cortex A72 cores. Three generations later, AWS’s Graviton 4 packs 96 Neoverse…

ChipAndCheese

Before

After