简单复盘一下 AWS 这次事件作为一个 AIGC Startup SRE 的一些操作吧,希望能帮到大家
从入职开始发现我们主要的集群在 USE1 之后,我就开始做一些准备了。
我主要做的事情有这几件事
1. 将我们核心的几个数据库做了多地的备份,形成了 USE1,Tokyo,SG 三地备份。这样在极端情况下,我们损失一部分数据,但是也能保证服务的继续
2. 将我们 SG 的测试集群从原本的 EC2 自己简单搭的 K3S,重构为了一个标准的 AWS EKS 集群。这样可以在灾害时刻快速 warmup 一个集群,复用 AWS 已有组件。将 manifest 变更的代价降至最小
3. 简单梳理了一个 SOP,包含用户公告,DNS 切换,封版等事宜
回到今天,我大概在 AWS 事故发生后的10min,发现了我们容器中有新的 Pod 无法 setup。
在和 AWS 支持确认是 USE1 的问题后,我意识到 ECR 的事件必然关联其余事件,于是我就果断按照我自己规划的 Tier1 等级事件开始处理(对于 SRE 来说,这种事情宁可错,但是不能错过)
T+0 min,我发布了全员公告,开始进入紧急模式。我 setup 了一个全员公开会议。所有人员可以随时加入
T+2 min,我确认事件如我所预期的一样,在逐渐扩大,我发出了两个指令,1. 全线禁止任何代码合入/提交(主要是避免新创建资源会导致 Pod rotate 进而影响流量),2. 请运营同学准备公告
T+3 min, 我开始按照 SOP,开始进行数据库在 SG 区域的恢复,并且级联创建诸如 OpenSearch / Redis 等在内的依赖
T+5 min,我们开始正式的确认上下游依赖的具体问题,确认一个新上线的核心服务受到影响
T+10min,我们停服公告和其余服务的受影响公告发出
T+10min,我请另外两位同时协助 setup 新的 ECR 以及清理测试环境已有资源,并同步 CTO ,在极端情况下,我们可能会存在保体验,丢数据的决策。
T+15min, 我们最终确认目前已创建的资源以及流量入方向不会受到太大影响。切换方案挂起,但是我们继续准备相关资源
T+30min,我们第一个数据库恢复完毕
T+40min,我们第二个数据库恢复完毕
T+1h,我们所有关联的核心 infra,RDS/ES/Redis 都 stand by,并且按照生产架构设置主从等优化选项。同时我们也开始正在新的集群启动新的服务
所幸,最终 AWS 的 crash 没有影响我们全部服务。我们无须面对切换流量后复杂的数据修复工作
大概 T+2h 到 T+3h 后,我正式通报全员,紧急状态解除。为保险起见,今晚依旧对 feature 封版。
回顾整个事故,我还可以做的更多
1. 将我之前为自己准备的极端 case SOP,对全员公开。这样确保我即便不在线,也有人能接替我
2. 我们可以做一些提前的预先演练
3. 指令下达可以更果断一些
差不多就是这样,一点分享,希望能帮到大家
从入职开始发现我们主要的集群在 USE1 之后,我就开始做一些准备了。
我主要做的事情有这几件事
1. 将我们核心的几个数据库做了多地的备份,形成了 USE1,Tokyo,SG 三地备份。这样在极端情况下,我们损失一部分数据,但是也能保证服务的继续
2. 将我们 SG 的测试集群从原本的 EC2 自己简单搭的 K3S,重构为了一个标准的 AWS EKS 集群。这样可以在灾害时刻快速 warmup 一个集群,复用 AWS 已有组件。将 manifest 变更的代价降至最小
3. 简单梳理了一个 SOP,包含用户公告,DNS 切换,封版等事宜
回到今天,我大概在 AWS 事故发生后的10min,发现了我们容器中有新的 Pod 无法 setup。
在和 AWS 支持确认是 USE1 的问题后,我意识到 ECR 的事件必然关联其余事件,于是我就果断按照我自己规划的 Tier1 等级事件开始处理(对于 SRE 来说,这种事情宁可错,但是不能错过)
T+0 min,我发布了全员公告,开始进入紧急模式。我 setup 了一个全员公开会议。所有人员可以随时加入
T+2 min,我确认事件如我所预期的一样,在逐渐扩大,我发出了两个指令,1. 全线禁止任何代码合入/提交(主要是避免新创建资源会导致 Pod rotate 进而影响流量),2. 请运营同学准备公告
T+3 min, 我开始按照 SOP,开始进行数据库在 SG 区域的恢复,并且级联创建诸如 OpenSearch / Redis 等在内的依赖
T+5 min,我们开始正式的确认上下游依赖的具体问题,确认一个新上线的核心服务受到影响
T+10min,我们停服公告和其余服务的受影响公告发出
T+10min,我请另外两位同时协助 setup 新的 ECR 以及清理测试环境已有资源,并同步 CTO ,在极端情况下,我们可能会存在保体验,丢数据的决策。
T+15min, 我们最终确认目前已创建的资源以及流量入方向不会受到太大影响。切换方案挂起,但是我们继续准备相关资源
T+30min,我们第一个数据库恢复完毕
T+40min,我们第二个数据库恢复完毕
T+1h,我们所有关联的核心 infra,RDS/ES/Redis 都 stand by,并且按照生产架构设置主从等优化选项。同时我们也开始正在新的集群启动新的服务
所幸,最终 AWS 的 crash 没有影响我们全部服务。我们无须面对切换流量后复杂的数据修复工作
大概 T+2h 到 T+3h 后,我正式通报全员,紧急状态解除。为保险起见,今晚依旧对 feature 封版。
回顾整个事故,我还可以做的更多
1. 将我之前为自己准备的极端 case SOP,对全员公开。这样确保我即便不在线,也有人能接替我
2. 我们可以做一些提前的预先演练
3. 指令下达可以更果断一些
差不多就是这样,一点分享,希望能帮到大家
US Government Uptime Monitor https://usa-status.com/
Daniel Lemire's blog
Speeding up C++ functions with a thread_local cache
In large code bases, we are often stuck with unpleasant designs that are harming our performance. We might be looking for a non-intrusive method to improve the performance. For example, you may not want to change the function signatures.
Let us consider a concret example. Maybe someone designed the programming interface so that you have to access the values from a map using an index. They may have code like so:
“`cpp
auto at_index(map_like auto& index_map, size_t idx) {
size_t count = 0;
for (const auto &[key, value] : index_map) {
if(count == idx)
return value;
count++;
}
throw std::out_of_range(“Index out of range”);
}
“`
This code goes through the keys of the map `idx` times. Typictally, it implies some kind of linked list traversal. If you are stuck with this interface, going through the values might imply repeated calls to the `at_index` function:
“`cpp
for (size_t i = 0; i < input_size; ++i) { at_index(index_map, i); } ``` If you took any kind of computer science, you will immediately see the problem: my code has quadratic complexity. If you double the map size, you may quadruple the running time. It is likely fine if you have 2 or 4 elements in the map, but definitively not fine if you have 400 elements. The proper solution is to avoid such a design. If you can have access directly to the map, you can just iterate through it directly: ```cpp for (auto& [key, value] : index_map) { sum += value; } ``` But what if you are stuck? Then you can use a a static or `thread_local` cache. The key inisght is to keep in cache your location in the map, and start from there on the next query. If the user is typictally querying in seqpence, then your cache should speed up tremendously the function. ```cpp auto at_index_thread_local_cache(map_like auto& index_map, size_t idx) { using iterator = decltype(index_map.begin()); struct Cache { iterator last_iterator; size_t last_index = -1; }; thread_local Cache cache; if (idx == cache.last_index + 1 && cache.last_iterator != index_map.end()) { cache.last_iterator++; cache.last_index = idx; if (cache.last_iterator != index_map.end()) { return cache.last_iterator->second;
} else {
throw std::out_of_range(“Index out of range”);
}
} else {
cache.last_iterator = index_map.begin();
cache.last_index = -1;
size_t count = 0;
for (auto it = index_map.begin();
it != index_map.end(); ++it) {
if (count == idx) {
cache.last_iterator = it;
cache.last_index = idx;
return it->second;
}
count++;
}
throw std::out_of_range(“Index out of range”);
}
}
“`
In C++, a `thread_local` variable is such that there is just one instance of the variable (shared by all function calls) within the same thread. If you wish to have just one instance of the variable for the entire program, you can use `static` instead, but `thread_local` is the best choice in our case. You might be worried about the perFormance implication of a `thread_local` variable, but it is generally quite cheap: we only add a few instructions when acceasing it or modifying it.
Our cache variable remembers the last accessed iterator and index per thread. If the next index is requtested, we just increment the iterator and return. If the access is non-sequential or the first call, it falls back to a linear scan from the beginning, rebuilding the cache along the way.
The code is more complicated, and if you are not accessing the key in sequence, it might be slower. However, the performance gains can be enormous. [I wrote a benchmark to test it out with maps containing 400 elements](https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2025/10/19/).
| Method | ns/key | instructions/key |
| ——– | ——– | ——– |
| original | 300 | 2000 |
| cache | 2 | 17 |
In my case, the cache multiplied the performance by 150. Not bad.
source
Speeding up C++ functions with a thread_local cache
In large code bases, we are often stuck with unpleasant designs that are harming our performance. We might be looking for a non-intrusive method to improve the performance. For example, you may not want to change the function signatures.
Let us consider a concret example. Maybe someone designed the programming interface so that you have to access the values from a map using an index. They may have code like so:
“`cpp
auto at_index(map_like auto& index_map, size_t idx) {
size_t count = 0;
for (const auto &[key, value] : index_map) {
if(count == idx)
return value;
count++;
}
throw std::out_of_range(“Index out of range”);
}
“`
This code goes through the keys of the map `idx` times. Typictally, it implies some kind of linked list traversal. If you are stuck with this interface, going through the values might imply repeated calls to the `at_index` function:
“`cpp
for (size_t i = 0; i < input_size; ++i) { at_index(index_map, i); } ``` If you took any kind of computer science, you will immediately see the problem: my code has quadratic complexity. If you double the map size, you may quadruple the running time. It is likely fine if you have 2 or 4 elements in the map, but definitively not fine if you have 400 elements. The proper solution is to avoid such a design. If you can have access directly to the map, you can just iterate through it directly: ```cpp for (auto& [key, value] : index_map) { sum += value; } ``` But what if you are stuck? Then you can use a a static or `thread_local` cache. The key inisght is to keep in cache your location in the map, and start from there on the next query. If the user is typictally querying in seqpence, then your cache should speed up tremendously the function. ```cpp auto at_index_thread_local_cache(map_like auto& index_map, size_t idx) { using iterator = decltype(index_map.begin()); struct Cache { iterator last_iterator; size_t last_index = -1; }; thread_local Cache cache; if (idx == cache.last_index + 1 && cache.last_iterator != index_map.end()) { cache.last_iterator++; cache.last_index = idx; if (cache.last_iterator != index_map.end()) { return cache.last_iterator->second;
} else {
throw std::out_of_range(“Index out of range”);
}
} else {
cache.last_iterator = index_map.begin();
cache.last_index = -1;
size_t count = 0;
for (auto it = index_map.begin();
it != index_map.end(); ++it) {
if (count == idx) {
cache.last_iterator = it;
cache.last_index = idx;
return it->second;
}
count++;
}
throw std::out_of_range(“Index out of range”);
}
}
“`
In C++, a `thread_local` variable is such that there is just one instance of the variable (shared by all function calls) within the same thread. If you wish to have just one instance of the variable for the entire program, you can use `static` instead, but `thread_local` is the best choice in our case. You might be worried about the perFormance implication of a `thread_local` variable, but it is generally quite cheap: we only add a few instructions when acceasing it or modifying it.
Our cache variable remembers the last accessed iterator and index per thread. If the next index is requtested, we just increment the iterator and return. If the access is non-sequential or the first call, it falls back to a linear scan from the beginning, rebuilding the cache along the way.
The code is more complicated, and if you are not accessing the key in sequence, it might be slower. However, the performance gains can be enormous. [I wrote a benchmark to test it out with maps containing 400 elements](https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2025/10/19/).
| Method | ns/key | instructions/key |
| ——– | ——– | ——– |
| original | 300 | 2000 |
| cache | 2 | 17 |
In my case, the cache multiplied the performance by 150. Not bad.
source
Chips and Cheese
AMD’s Chiplet APU: An Overview of Strix Halo
#ChipAndCheese
Telegraph | source
(author: George Cozma)
AMD’s Chiplet APU: An Overview of Strix Halo
#ChipAndCheese
Telegraph | source
(author: George Cozma)
Chips and Cheese
Panther Lake’s Reveal at ITT 2025
#ChipAndCheese
Telegraph | source
(author: George Cozma)
Panther Lake’s Reveal at ITT 2025
#ChipAndCheese
Telegraph | source
(author: George Cozma)
Chips and Cheese
Interviewing Intel's Chief Architect of x86 Cores at Intel Tech Tour 2025
#ChipAndCheese
Telegraph | source
(author: George Cozma)
Interviewing Intel's Chief Architect of x86 Cores at Intel Tech Tour 2025
#ChipAndCheese
Telegraph | source
(author: George Cozma)
Facebook 发布了 OpenZL,一个可以通过学习文件格式的结构同时优化压缩率、压缩速度、解压速度的算法。
在使用时,程序员可以编写对于文件结构的描述,并生成/训练特定于这一文件格式的压缩算法,使得可以通过文件本身内部结构生成更容易压缩的数据流。所有这些压缩后的数据流都可以共享同一个解压算法,压缩算法改变时不需要修改。当输入无特定结构时,算法回退到 zstd。
https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/
在使用时,程序员可以编写对于文件结构的描述,并生成/训练特定于这一文件格式的压缩算法,使得可以通过文件本身内部结构生成更容易压缩的数据流。所有这些压缩后的数据流都可以共享同一个解压算法,压缩算法改变时不需要修改。当输入无特定结构时,算法回退到 zstd。
https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/
Chips and Cheese
AMD’s EPYC 9355P: Inside a 32 Core Zen 5 Server Chip
#ChipAndCheese
Telegraph | source
(author: Chester Lam)
AMD’s EPYC 9355P: Inside a 32 Core Zen 5 Server Chip
#ChipAndCheese
Telegraph | source
(author: Chester Lam)
Chips and Cheese
A Look into Intel Xeon 6’s Memory Subsystem
#ChipAndCheese
Telegraph | source
(author: Chester Lam)
A Look into Intel Xeon 6’s Memory Subsystem
#ChipAndCheese
Telegraph | source
(author: Chester Lam)
AMD's Inaugural Tech Day ft. ROCm 7, Modular, and AMD Lab Tour
#ChipAndCheese
Hello you fine Internet folks,
I was invited to AMD's Austin Headquarters where AMD held their inaugural Tech Day where AMD announced ROCm 7, Modular showed off their results with MI355X, which was topped off by a tour of AMD's Labs.
Hope y'all enjoy!
iframe (www.youtube-nocookie.com)
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
source
(author: George Cozma)