Easton Man's Channel

@EastonMan 看的新闻
+碎碎念
+膜大佬
+偶尔猫猫
+伊斯通听的歌

23:21 · Oct 20, 2025 · Mon

简单复盘一下 AWS 这次事件作为一个 AIGC Startup SRE 的一些操作吧，希望能帮到大家
从入职开始发现我们主要的集群在 USE1 之后，我就开始做一些准备了。
我主要做的事情有这几件事
1. 将我们核心的几个数据库做了多地的备份，形成了 USE1，Tokyo，SG 三地备份。这样在极端情况下，我们损失一部分数据，但是也能保证服务的继续
2. 将我们 SG 的测试集群从原本的 EC2 自己简单搭的 K3S，重构为了一个标准的 AWS EKS 集群。这样可以在灾害时刻快速 warmup 一个集群，复用 AWS 已有组件。将 manifest 变更的代价降至最小
3. 简单梳理了一个 SOP，包含用户公告，DNS 切换，封版等事宜

回到今天，我大概在 AWS 事故发生后的10min，发现了我们容器中有新的 Pod 无法 setup。

在和 AWS 支持确认是 USE1 的问题后，我意识到 ECR 的事件必然关联其余事件，于是我就果断按照我自己规划的 Tier1 等级事件开始处理（对于 SRE 来说，这种事情宁可错，但是不能错过）

T+0 min，我发布了全员公告，开始进入紧急模式。我 setup 了一个全员公开会议。所有人员可以随时加入
T+2 min，我确认事件如我所预期的一样，在逐渐扩大，我发出了两个指令，1. 全线禁止任何代码合入/提交（主要是避免新创建资源会导致 Pod rotate 进而影响流量），2. 请运营同学准备公告
T+3 min, 我开始按照 SOP，开始进行数据库在 SG 区域的恢复，并且级联创建诸如 OpenSearch / Redis 等在内的依赖
T+5 min，我们开始正式的确认上下游依赖的具体问题，确认一个新上线的核心服务受到影响
T+10min，我们停服公告和其余服务的受影响公告发出
T+10min，我请另外两位同时协助 setup 新的 ECR 以及清理测试环境已有资源，并同步 CTO ，在极端情况下，我们可能会存在保体验，丢数据的决策。
T+15min，我们最终确认目前已创建的资源以及流量入方向不会受到太大影响。切换方案挂起，但是我们继续准备相关资源
T+30min，我们第一个数据库恢复完毕
T+40min，我们第二个数据库恢复完毕
T+1h，我们所有关联的核心 infra，RDS/ES/Redis 都 stand by，并且按照生产架构设置主从等优化选项。同时我们也开始正在新的集群启动新的服务
所幸，最终 AWS 的 crash 没有影响我们全部服务。我们无须面对切换流量后复杂的数据修复工作
大概 T+2h 到 T+3h 后，我正式通报全员，紧急状态解除。为保险起见，今晚依旧对 feature 封版。

回顾整个事故，我还可以做的更多
1. 将我之前为自己准备的极端 case SOP，对全员公开。这样确保我即便不在线，也有人能接替我
2. 我们可以做一些提前的预先演练
3. 指令下达可以更果断一些

差不多就是这样，一点分享，希望能帮到大家

13:02 · Oct 20, 2025 · Mon

US Government Uptime Monitor https://usa-status.com/

05:29 · Oct 20, 2025 · Mon

Daniel Lemire's blog
Speeding up C++ functions with a thread_local cache

In large code bases, we are often stuck with unpleasant designs that are harming our performance. We might be looking for a non-intrusive method to improve the performance. For example, you may not want to change the function signatures.

Let us consider a concret example. Maybe someone designed the programming interface so that you have to access the values from a map using an index. They may have code like so:

“`cpp
auto at_index(map_like auto& index_map, size_t idx) {
size_t count = 0;
for (const auto &[key, value] : index_map) {
if(count == idx)
return value;
count++;
}
throw std::out_of_range(“Index out of range”);
}
“`

This code goes through the keys of the map `idx` times. Typictally, it implies some kind of linked list traversal. If you are stuck with this interface, going through the values might imply repeated calls to the `at_index` function:

“`cpp
for (size_t i = 0; i < input_size; ++i) { at_index(index_map, i); } ``` If you took any kind of computer science, you will immediately see the problem: my code has quadratic complexity. If you double the map size, you may quadruple the running time. It is likely fine if you have 2 or 4 elements in the map, but definitively not fine if you have 400 elements. The proper solution is to avoid such a design. If you can have access directly to the map, you can just iterate through it directly: ```cpp for (auto& [key, value] : index_map) { sum += value; } ``` But what if you are stuck? Then you can use a a static or `thread_local` cache. The key inisght is to keep in cache your location in the map, and start from there on the next query. If the user is typictally querying in seqpence, then your cache should speed up tremendously the function. ```cpp auto at_index_thread_local_cache(map_like auto& index_map, size_t idx) { using iterator = decltype(index_map.begin()); struct Cache { iterator last_iterator; size_t last_index = -1; }; thread_local Cache cache; if (idx == cache.last_index + 1 && cache.last_iterator != index_map.end()) { cache.last_iterator++; cache.last_index = idx; if (cache.last_iterator != index_map.end()) { return cache.last_iterator->second;
} else {
throw std::out_of_range(“Index out of range”);
}
} else {
cache.last_iterator = index_map.begin();
cache.last_index = -1;
size_t count = 0;
for (auto it = index_map.begin();
it != index_map.end(); ++it) {
if (count == idx) {
cache.last_iterator = it;
cache.last_index = idx;
return it->second;
}
count++;
}
throw std::out_of_range(“Index out of range”);
}
}
“`

In C++, a `thread_local` variable is such that there is just one instance of the variable (shared by all function calls) within the same thread. If you wish to have just one instance of the variable for the entire program, you can use `static` instead, but `thread_local` is the best choice in our case. You might be worried about the perFormance implication of a `thread_local` variable, but it is generally quite cheap: we only add a few instructions when acceasing it or modifying it.

Our cache variable remembers the last accessed iterator and index per thread. If the next index is requtested, we just increment the iterator and return. If the access is non-sequential or the first call, it falls back to a linear scan from the beginning, rebuilding the cache along the way.

The code is more complicated, and if you are not accessing the key in sequence, it might be slower. However, the performance gains can be enormous. [I wrote a benchmark to test it out with maps containing 400 elements](https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2025/10/19/).

| Method | ns/key | instructions/key |
| ——– | ——– | ——– |
| original | 300 | 2000 |
| cache | 2 | 17 |

In my case, the cache multiplied the performance by 150. Not bad.

source

12:00 · Oct 18, 2025 · Sat

Chips and Cheese
AMD’s Chiplet APU: An Overview of Strix Halo
#ChipAndCheese

Telegraph | source
(author: George Cozma)

Telegraph

AMD’s Chiplet APU: An Overview of Strix Halo

Hello you fine Internet folks! Today we are looking at AMD’s largest client APU to date, Strix Halo. This is an APU designed to be a true all-in-one mobile processor, able to handle high end CPU and GPU workloads without compromise. Offering a TDP range of…

ChipAndCheese

02:52 · Oct 18, 2025 · Sat

Daniel Lemire's blog
Research results are cultural artifacts, not public goods

Telegraph | source

Telegraph

Research results are cultural artifacts, not public goods

Many view scientific research as a public good. I consider this naive and indefensible. Scientific progress hinges on people and culture, not on research results as public goods.What do they mean by a public good? A public good is non-excludable: Once a scientific…

06:24 · Oct 16, 2025 · Thu

Daniel Lemire's blog
Speed of random number generators in Go

source

01:13 · Oct 15, 2025 · Wed

Chips and Cheese
Panther Lake’s Reveal at ITT 2025
#ChipAndCheese

Telegraph | source
(author: George Cozma)

Telegraph

Panther Lake’s Reveal at ITT 2025

Hello you fine Internet folks, At this year’s ITT, Intel announced their upcoming client SoC, Panther Lake (PTL) along with all of the improvements that they have made to this latest SoC. With a new node, improvements to the CPU cores, a brand new GPU, and…

ChipAndCheese

00:10 · Oct 10, 2025 · Fri

Chips and Cheese
Interviewing Intel's Chief Architect of x86 Cores at Intel Tech Tour 2025
#ChipAndCheese

Telegraph | source
(author: George Cozma)

Telegraph

Interviewing Intel's Chief Architect of x86 Cores at Intel T…

Hello you fine Internet folks, I was at Intel Tech Tour this year where Intel talked about their upcoming Panther Lake and Clearwater Forest CPUs. I had a chance to sit down with Stephen Robinson, Lead Architect for x86 Cores at Intel, and talk about lntel’s…

ChipAndCheese

16:34 · Oct 9, 2025 · Thu

有兴趣参加每双周周末举办的RISC-V交流会吗？

Anonymous Poll

15:48 · Oct 9, 2025 · Thu

杰哥的{运维，编程，调板子}小笔记
本博客近三个月来的访问数据观察

Telegraph | source

Telegraph

本博客近三个月来的访问数据观察

本博客近三个月来的访问数据观察¶ 写在前面¶ 这个博客自 2014 年更新至今，已走过近十一个年头，累计发布了四百多篇文章。出于好奇，我一直想了解哪些内容更受读者欢迎。五年前，我曾配置过 Google Analytics，但使用体验并不理想，于是转而自行部署了 rybbit 实例来收集访问数据。如今三个月过去，是时候与大家分享一些有趣的发现。 P.S. 如果你对数据收集有所顾虑，可以屏蔽对应的 analytics 脚本。数据总览与趋势¶ 首先来看这三个月内的整体访问情况与趋势：访问量比我的预期要高一…

22:19 · Oct 8, 2025 · Wed

Daniel Lemire's blog
Aesthetics matter

Telegraph | source

Telegraph

Aesthetics matter

Generated by RSStT. The copyright belongs to the original author. Source

14:11 · Oct 7, 2025 · Tue

Facebook 发布了 OpenZL，一个可以通过学习文件格式的结构同时优化压缩率、压缩速度、解压速度的算法。

在使用时，程序员可以编写对于文件结构的描述，并生成/训练特定于这一文件格式的压缩算法，使得可以通过文件本身内部结构生成更容易压缩的数据流。所有这些压缩后的数据流都可以共享同一个解压算法，压缩算法改变时不需要修改。当输入无特定结构时，算法回退到 zstd。
https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/

Engineering at Meta

Introducing OpenZL: An Open Source Format-Aware Compression Framework

OpenZL is a new open source data compression framework that offers lossless compression for structured data. OpenZL is designed to offer the performance of a format-specific compressor with the eas…

05:42 · Oct 6, 2025 · Mon

Daniel Lemire's blog
std::ranges may not deliver the performance that you expect

source

03:26 · Oct 4, 2025 · Sat

Daniel Lemire's blog
Beyond papers: rethinking science in the era of artificial intelligence

Telegraph | source

Telegraph

Beyond papers: rethinking science in the era of artificial i…

For the last few decades, we have defined science by the publication of peer-reviewed research. That is, a scientist is someone who can write a paper and have it reviewed by two or four other scientists who conclude that the work is credible. Notice how this…

02:30 · Oct 1, 2025 · Wed

Chips and Cheese
AMD’s EPYC 9355P: Inside a 32 Core Zen 5 Server Chip
#ChipAndCheese

Telegraph | source
(author: Chester Lam)

Telegraph

AMD’s EPYC 9355P: Inside a 32 Core Zen 5 Server Chip

High core count chips are headline grabbers. But maxing out the metaphorical core count slider isn’t the only way to go. Server players like Intel, AMD, and Arm aim for scalable designs that cover a range of core counts. Not all applications can take advantage…

ChipAndCheese

05:33 · Sep 30, 2025 · Tue

Daniel Lemire's blog
The smallest number that is infinite

Telegraph | source

Telegraph

The smallest number that is infinite

In software, we represent real numbers as binary floating-point numbers. Effectively, we represent real numbers as a fixed-precision integer (the significand) multiplied by a power of two. Thus we do not represent exactly the number ‘pi’, but we get a very…

16:46 · Sep 28, 2025 · Sun

Harry Chen’s Blog
2024 年，我都去了哪？

Telegraph | source
(author: Shengqi Chen (i@harrychen.xyz))

Telegraph

2024 年，我都去了哪？

转眼间，2025 年已经快过去了（显然，我还没有毕业）。上周翻阅照片的时候，意识到我这两年还是去了不少地方，甚至有些地方都快忘记了。正好有朋友催更我的博客已久，干脆挑选一些值得记录的，写一些流水账。也希望我在 2025 年结束前能写完它。本篇仅包含 2024 年的行程，2025 年此后（希望）会另开一篇。 PPoPP’24 - 英国爱丁堡 (29 Feb - 8 Mar)

00:46 · Sep 27, 2025 · Sat

Chips and Cheese
A Look into Intel Xeon 6’s Memory Subsystem
#ChipAndCheese

Telegraph | source
(author: Chester Lam)

Telegraph

A Look into Intel Xeon 6’s Memory Subsystem

Intel’s server dominance has been shaken by high core count competition from the likes of AMD and Arm. Xeon 6, Intel’s latest server platform, aims to address this with a more scalable chiplet strategy. Chiplets are now arranged side by side, with IO chiplets…

ChipAndCheese

18:18 · Sep 24, 2025 · Wed

航线连接大脑节油代替思考

04:09 · Sep 23, 2025 · Tue

Chips and Cheese
AMD's Inaugural Tech Day ft. ROCm 7, Modular, and AMD Lab Tour
#ChipAndCheese

Hello you fine Internet folks,

I was invited to AMD's Austin Headquarters where AMD held their inaugural Tech Day where AMD announced ROCm 7, Modular showed off their results with MI355X, which was topped off by a tour of AMD's Labs.

Hope y'all enjoy!

iframe (www.youtube-nocookie.com)

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

source
(author: George Cozma)

ChipAndCheese

Before

After