Chips and Cheese
AMD's RDNA4 Architecture (Video)
#ChipAndCheese
Telegraph | source
(author: George Cozma)
AMD's RDNA4 Architecture (Video)
#ChipAndCheese
Telegraph | source
(author: George Cozma)
Chips and Cheese
Zen 5's AVX-512 Frequency Behavior
#ChipAndCheese
Telegraph | source
(author: Chester Lam)
Zen 5's AVX-512 Frequency Behavior
#ChipAndCheese
Telegraph | source
(author: Chester Lam)
想要变 moe 就要先研究 MoE
vllm 也可以是 llvm
Arch Linux: Recent news updates
Cleaning up old repositories
Around two years ago, we've merged the
On systems where
The following deprecated repositories will be removed:
Please make sure to remove all use of the aforementioned repositories from your
source
(author: Sven-Hendrik Haase)
Cleaning up old repositories
Around two years ago, we've merged the
[community] repository into [extra] as part of the git migration. In order to not break user setups, we kept these repositories around in an unused and empty state. We're going to clean up these old repositories on 2025-03-01.On systems where
/etc/pacman.conf still references the old [community] repository, pacman -Sy will return an error on trying to sync repository metadata.The following deprecated repositories will be removed:
[community], [community-testing], [testing], [testing-debug], [staging], [staging-debug].Please make sure to remove all use of the aforementioned repositories from your
/etc/pacman.conf (for which a .pacnew was shipped with pacman>=6.0.2-7)!source
(author: Sven-Hendrik Haase)
Daniel Lemire's blog
AVX-512 gotcha: avoid compressing words to memory with AMD Zen 4 processors
The recent AMD processors (Zen 4) provide extensive support for the powerful AVX-512 instructions. AVX-512 (Advanced Vector Extensions 512) is an extension to the x86 instruction set architecture (ISA) introduced by Intel. These instructions enhance the capabilities of processors by allowing for more data to be processed in parallel. You can process registers made of 64 bytes!
One of the neat trick is that given a mask, you can ‘compress’ words: Suppose that you have a vector made of thirty-two 16-bit words, and you want to only keep the second one and third one, then you can use the vpcompressw instruction and the mask 0b110. It will produce a register where the second and third words are placed in first and second position.
An even nicer trick is that you can use this instruction to write just these two words out to memory. You can invoke this functionality with the _mm_mask_compressstoreu_epi16 function intrinsic.
This works well on recent Intel processors, but not so well on AMD Zen 4 processors.
We have a fast function in the simdjson library to minify a file (remove unnecessary spaces).
https://github.com/simdjson/simdjson/pull/2335
source
AVX-512 gotcha: avoid compressing words to memory with AMD Zen 4 processors
The recent AMD processors (Zen 4) provide extensive support for the powerful AVX-512 instructions. AVX-512 (Advanced Vector Extensions 512) is an extension to the x86 instruction set architecture (ISA) introduced by Intel. These instructions enhance the capabilities of processors by allowing for more data to be processed in parallel. You can process registers made of 64 bytes!
One of the neat trick is that given a mask, you can ‘compress’ words: Suppose that you have a vector made of thirty-two 16-bit words, and you want to only keep the second one and third one, then you can use the vpcompressw instruction and the mask 0b110. It will produce a register where the second and third words are placed in first and second position.
An even nicer trick is that you can use this instruction to write just these two words out to memory. You can invoke this functionality with the _mm_mask_compressstoreu_epi16 function intrinsic.
This works well on recent Intel processors, but not so well on AMD Zen 4 processors.
We have a fast function in the simdjson library to minify a file (remove unnecessary spaces).
https://github.com/simdjson/simdjson/pull/2335
source
Social Stockfish
像国际象棋分析引擎一样预测和你对话对象的接下来 5 次交流,从而告诉你当前最好的回复。
https://fixvx.com/eddybuild/status/1889908182501433669
像国际象棋分析引擎一样预测和你对话对象的接下来 5 次交流,从而告诉你当前最好的回复。
https://fixvx.com/eddybuild/status/1889908182501433669
这个脑洞有意思
GNU Gold Linker Is Deprecated & Will Be Gone For Good Without New Developers
https://www.phoronix.com/news/GNU-Gold-Linker-Deprecated
https://www.phoronix.com/news/GNU-Gold-Linker-Deprecated
Chips and Cheese
Intel’s Battlemage Architecture
#ChipAndCheese
Telegraph | source
(author: Chester Lam)
Intel’s Battlemage Architecture
#ChipAndCheese
Telegraph | source
(author: Chester Lam)
Daniel Lemire's blog
Thread-safe memory copy
A common operation in software is the copy of a block of memory. In C/C++, we often call the function memcpy for this purpose.
But what happens if, while you are copying the data, another thread is modifying either the source or the destination? The result is fundamentally unpredictable and almost surely a programming error.
Why would you ever code a copy function in such a way given that it is an error? Suppose you are implementing a JavaScript engine in C++, like Google v8. In JavaScript, we have SharedArrayBuffer instances that can be modified and copied from different threads. As the engineer working on the JavaScript engine, you cannot always prevent users from writing buggy code.
In any case, you get a data race: two or more threads access the same memory location simultaneously, where at least one of the accesses is a write operation, without a synchronization mechanism to ensure that these operations occur in a specific order.
What happens? The C++ standard states that a data race results in undefined behavior. In effect, the C++ language does not tell you what happens. A crash might occur. Of course, the JavaScript engineer would rather not see a crash.
Importantly, ‘undefined behavior’ also does not tell you that there is necessarily an error. Effectively, it tells you that as programmer, you acquire the additional responsibility to ensure that it is safe code. There is no warranty coming from the programming language itself.
Why do languages like C and C++ leave undefined behavior?
A good analogy is an organization with many sub-components, where new sub-components could be added at any time. Think of an interstellar federation of planets. The interstellar federation can specify overall laws that are well defined, but there will be remaining corner cases that are specific to which planet you reside in.
That’s the spirit of C and C++: these programming languages can target a very wide range of platforms. For some of these platforms, a data race is without consequence… for others, it could be highly problematic or just slow. Also, by not specifying the behavior, it allows the compiler designer some options. So the programming language leaves it up to you to check.
Consider a conflictual memory copy where you, for example, copy from array A to array B while another thread copies from array B to array A. Under most platforms, this will not cause a crash or anything especially dangerous. You might get garbage data in your arrays, in the worst case.
But if you use automated sanitizer tools, you may still get a warning regarding the data race, even when it is inconsequential. You can silence the warning, by telling the tools that you have a check that the copy is safe.
Instead, you could roll your own ‘safe’ memory copy, where load the content byte by byte (for example) in an atomic fashion. A possible solution in C++20 looks like so:
We have now done away with any kind of undefined behavior. The code ought to be perfectly ‘safe’, there is no more data race.
So why not always use this safe approach?
Because it can be 40 times slower than a conventional memory copy.
It becomes an engineering question. Sometimes performance really does not matter.
In programming, there is practically never a free lunch. It is common that you have take your pick: aim for high performance but acquire more responsibilities, or sacrifice performance for the sake of having fewer worries.
source
Thread-safe memory copy
A common operation in software is the copy of a block of memory. In C/C++, we often call the function memcpy for this purpose.
But what happens if, while you are copying the data, another thread is modifying either the source or the destination? The result is fundamentally unpredictable and almost surely a programming error.
Why would you ever code a copy function in such a way given that it is an error? Suppose you are implementing a JavaScript engine in C++, like Google v8. In JavaScript, we have SharedArrayBuffer instances that can be modified and copied from different threads. As the engineer working on the JavaScript engine, you cannot always prevent users from writing buggy code.
In any case, you get a data race: two or more threads access the same memory location simultaneously, where at least one of the accesses is a write operation, without a synchronization mechanism to ensure that these operations occur in a specific order.
What happens? The C++ standard states that a data race results in undefined behavior. In effect, the C++ language does not tell you what happens. A crash might occur. Of course, the JavaScript engineer would rather not see a crash.
Importantly, ‘undefined behavior’ also does not tell you that there is necessarily an error. Effectively, it tells you that as programmer, you acquire the additional responsibility to ensure that it is safe code. There is no warranty coming from the programming language itself.
Why do languages like C and C++ leave undefined behavior?
A good analogy is an organization with many sub-components, where new sub-components could be added at any time. Think of an interstellar federation of planets. The interstellar federation can specify overall laws that are well defined, but there will be remaining corner cases that are specific to which planet you reside in.
That’s the spirit of C and C++: these programming languages can target a very wide range of platforms. For some of these platforms, a data race is without consequence… for others, it could be highly problematic or just slow. Also, by not specifying the behavior, it allows the compiler designer some options. So the programming language leaves it up to you to check.
Consider a conflictual memory copy where you, for example, copy from array A to array B while another thread copies from array B to array A. Under most platforms, this will not cause a crash or anything especially dangerous. You might get garbage data in your arrays, in the worst case.
But if you use automated sanitizer tools, you may still get a warning regarding the data race, even when it is inconsequential. You can silence the warning, by telling the tools that you have a check that the copy is safe.
Instead, you could roll your own ‘safe’ memory copy, where load the content byte by byte (for example) in an atomic fashion. A possible solution in C++20 looks like so:
void safe_memcpy(char *dest, const char *src, size_t count) {
for (size_t i = 0; i < count; ++i) {
char input =
std::atomic_ref<const char>(src[i])
.load(std::memory_order_relaxed);
std::atomic_ref<char>(dest[i])
.store(input, std::memory_order_relaxed);
}
}We have now done away with any kind of undefined behavior. The code ought to be perfectly ‘safe’, there is no more data race.
So why not always use this safe approach?
Because it can be 40 times slower than a conventional memory copy.
It becomes an engineering question. Sometimes performance really does not matter.
In programming, there is practically never a free lunch. It is common that you have take your pick: aim for high performance but acquire more responsibilities, or sacrifice performance for the sake of having fewer worries.
source