Easton Man's Channel

Daniel Lemire's blog
Artificial Intelligence as the Expert’s Lever: Elevating Human Expertise in the Age of AI

The more likely outcome of the rise of generative artificial intelligence is higher value for the best experts… where ‘expert’ means ‘someone with experience solving real problems’.

“While one may worry that AI will simply render expertise redundant and experts superfluous, history and economic logic suggest otherwise. AI is a tool, like a calculator or a chainsaw, and tools generally aren’t substitutes for expertise but rather levers for its application.
By shortening the distance from intention to result, tools enable workers with proper training and judgment to accomplish tasks that were previously time-consuming, failure-prone or infeasible. Conversely, tools are useless at best — and hazardous at worst — to those lacking relevant training and experience. A pneumatic nail gun is an indispensable time-saver for a roofer and a looming impalement hazard for a home hobbyist.
For workers with foundational training and experience, AI can help to leverage expertise so they can do higher-value work. AI will certainly also automate existing work, rendering certain existing areas of expertise irrelevant. It will further instantiate new human capabilities, new goods and services that create demand for expertise we have yet to foresee.” (Autor, 2024)

source

14:57 · Jan 4, 2025 · Sat

Chips and Cheese
NeuReality: A Server on a PCIe Card
#ChipAndCheese

Telegraph | source
(author: George Cozma)

Telegraph

NeuReality: A Server on a PCIe Card

Hello you fine Internet folks,I made a mistake with our last article where I said that it would be our last bit of coverage on Supercomputing 2024. That was in fact incorrect, we have one more piece about Supercomputing 2024 which covers NeuReality. NeuReality…

ChipAndCheese

00:15 · Jan 3, 2025 · Fri

Daniel Lemire's blog
How does your URL parser handle Unicode?

Most strings today in software are Unicode strings. It means that you can include mathematical symbols, emojis and so forth. There are many different versions of the letter ‘M’, for example: the Roman letter M (U+004D) is semantically different from the Roman numeral Ⅿ (U+216F) while they both often have the same visual representation. John Cook has an interesting post on Unicode Stegonography: you can possibly use this ambiguity to hide messages in plain view. E.g., if you need to warn someone that you are in danger, you could send a text with the Roman numeral M. Normal people reading the text would not notice the difference.

What about URLs like Microsoft.com? What if you replace the Roman letter by a Roman numeral, is it still the same domain?

It is. URL parsers are required to normalize the URLs which involves, among other things, replacing look-alike letters with Roman letters if they are to be compliant with the WHATWG URL specification.

But do they? Do the URL parsers actually do this hard work? Let us check.

Java. I could not get the standard Java library to return to me the host. It simply returns a null String.

 String url = "https://microsoft.coⅯ";
 URI uri = new URI(url);
 String host = uri.getHost();

C#. The .NET library seems to just returns the domain as-is with the Roman numeral.

string url = "https://microsoft.coⅯ";
Uri uri = new Uri(url);
string host = uri.Host;

PHP. The standard PHP interpreter just returns the domain as-is, with the Roman numeral

$url = "https://microsoft.coⅯ";
$parsed_url = parse_url($url);
if ($parsed_url === false) {
 echo "URL could not be parsed.";
} else {
 $host = $parsed_url['host'];
}

Go. Go also does not do normalization.

urlString := "https://microsoft.coⅯ"
parsedURL, err := url.Parse(urlString)
if err != nil {
        fmt.Println("URL could not be parsed:", err)
        return
}
host := parsedURL.Host

Python. You guessed it: no normalization. It happily returns the Roman numeral.

url = "https://microsoft.coⅯ"
parsed_url = urllib.parse.urlparse(url)
host = parsed_url.netloc

JavaScript. JavaScript does it correctly. It will convert https://microsoft.coⅯ to https://microsoft.com.

const url = "https://microsoft.coⅯ";
const urlObj = new URL(url);
const host = urlObj.hostname;

C++. C++ does not have a standard URL parser, but if you use the ada URL parser, you will get correct results. If you are using the Node.js runtime environment, the underlying parser is the C++ ada URL parsing library.

auto url = ada::parse("https://microsoft.coⅯ");
if (!url) { /* failure */ }
std::string_view host = url->get_host();

source

13:35 · Jan 1, 2025 · Wed

Chips and Cheese
d-Matrix Corsair: 256GB of LPDDR for AI Models
#ChipAndCheese

Telegraph | source
(author: George Cozma)

Telegraph

d-Matrix Corsair: 256GB of LPDDR for AI Models

Hello you fine Internet folks,Today is the last of our Supercomputing 2024 coverage, but last does not mean least. Today we are going to be covering a new AI startup company called d-Matrix. d-Matrix first product is called the Corsair and it uses the Microscaling…

ChipAndCheese

11:15 · Jan 1, 2025 · Wed

░░░░░░░░░░░░░░░░░░░░ 0%

00:03 · Jan 1, 2025 · Wed

祝大家2025新年快乐🥰

02:29 · Dec 30, 2024 · Mon

Daniel Lemire's blog
Efficient In-Place UTF-16 Unicode Correction with ARM NEON

source

15:51 · Dec 29, 2024 · Sun

杰哥的{运维，编程，调板子}小笔记
Apple M1 微架构评测

source

09:36 · Dec 27, 2024 · Fri

Chips and Cheese
IBM Power - What's Next?
#ChipAndCheese

Telegraph | source
(author: George Cozma)

Telegraph

IBM Power - What's Next?

Hello you fine Internet folks, I had the opportunity to sit down and interview Bill Starke the Chief Architect of Power CPUs at IBM where we got to talk about the future of Power along with where he sees the industry going for future memory standards. But…

ChipAndCheese

05:39 · Dec 22, 2024 · Sun

Daniel Lemire's blog
Simpler and faster parsing code with std::views::split

source

07:40 · Dec 21, 2024 · Sat

Chips and Cheese
Skymont in Desktop Form: Atom Unleashed
#ChipAndCheese

Telegraph | source
(author: Chester Lam)

Telegraph

Skymont in Desktop Form: Atom Unleashed

Skymont is Intel's newest E-Core architecture. E-Cores trace their lineage to low power and low performance Atom cores of long ago. But E-Cores have become an integral part of Intel's high performance desktop strategy, letting Intel maintain competitive multithreaded…

ChipAndCheese

11:05 · Dec 18, 2024 · Wed

BIRD 3.0.0 https://gitlab.nic.cz/labs/bird/-/blob/v3.0.0/NEWS?ref_type=tags

12:20 · Dec 17, 2024 · Tue

杰哥的{运维，编程，调板子}小笔记

CPU 微架构逆向方法学¶

背景¶

最近做了不少微架构的评测，其中涉及到了很多的 CPU 微架构的逆向：

● Qualcomm Oryon 微架构评测
● AMD Zen 5 微架构评测
● ARM Neoverse V2 微架构评测

因此总结一下 CPU 微架构逆向方法学。

定义¶

首先定义一下：什么是 CPU 微架构逆向，我认为 CPU 微架构逆向包括两部分含义：

1. 在已经知道某 CPU 微架构采用某种设计，只是不知道其设计参数时，通过逆向，得到它的设计参数
2. 在不确定某 CPU 微架构采用的是什么设计，给出一些可能的设计，通过逆向，排除或确认其设计，再进一步找到它的设计参数

举一个例子，已经知道某 CPU 微架构有一个组相连的 L1 DCache，但不知道它的容量，几路组相连，此时通过微架构逆向的方法，可以得到它的容量，具体是几路组相连，进一步可能把它的 Index 函数也逆向出来。这是第一部分含义。

再举一个例子，已经知道某 CPU 微架构有一个分支预测器，但不知道它使用了什么信息来做预测，可能用了分支的地址，可能用了分支要跳转的目的地址，可能用了分支的方向，这时候通过微架构逆向的方法，对不同的可能性做排除，找到真正的那一个。如果不能排除到只剩一个可能，或者全部可能都被排除掉，说明实际的微架构设计和预期不相符。

第一部分含义，目前已经有大量的成熟的 Microbenchmark（针对微架构 Microarchitecture 设计的 Benchmark，叫做 Microbenchmark）来解决，它们针对常见的微架构设计，实现了对相应设计参数的逆向的 Microbenchmark，可以在很多平台上直接使用。第二部分含义，目前还只能逐个分析，去猜测背后的设计，再根据设计去构造对应该设计的 Microbenchmark。

下面主要来介绍，设计和实现 Microbenchmark 的方法学。

原理¶

首先要了解 Microbenchmark 的原理，它的核心思路就是，通过构造程序，让某个微架构部件成为瓶颈，接着在想要逆向的设计参数的维度上进行扫描，通过某种指标来反映是否出现了瓶颈，通过瓶颈对应的设计参数，就可以逆向出来设计参数的取值。这一段有点难理解，下面给一个例子：

比如要测试的是 L1 DCache 的容量，那就希望 L1 DCache 的容量变成瓶颈。为了让它成为瓶颈，那就需要不断地访问一片内存，它的大小比 L1 DCache 要更大，让 L1 DCache 无法完整保存下来，出现缓存缺失。为了判断缓存缺失是否出现，可以通过时间或周期，因为缓存缺失肯定会带来性能损失，也可以直接通过缓存缺失的性能计数器。既然要逆向的设计参数是 L1 DCache 的容量，那就在容量上进行一个扫描：在内存中开辟不同大小的数组，比如一个是 32KB，另一个是 64KB，每次测试的时候只访问其中一个数组。每个数组扫描访问若干次，然后统计总时间或周期数或缓存缺失次数。假如实际 L1 DCache 容量介于 32KB 和 64KB 之间，那么应该可以观察到 64KB 数组大小测得的性能相比 32KB 有明显下降。如果把测试粒度变细，每 1KB 设置一个数组大小，最终就可以确定实际的 L1 DCache 容量。

在上面这个例子里，成为瓶颈的微架构部件是 L1 DCache，想要逆向的设计参数是它的容量，反映是否出现瓶颈的指标是性能或缓存缺失次数，构造的程序做的事情是不断地访问一个可变大小的数组，其中数组大小和想要逆向的设计参数是挂钩的。

因此可以总结出 Microbenchmark 设计的几个要素：

1. 针对什么微架构部件
2. 针对该部件的什么设计参数
3. 反映出现瓶颈的指标是什么
4. 如何构造程序来导致瓶颈出现
5. 程序在什么情况下会导致瓶颈出现
6. 程序的参数如何对应到设计参数上

比如上面的 L1 DCache 容量的测试上，这几个要素的回答是：

1. 针对什么微架构部件：L1 DCache
2. 针对该部件的什么设计参数：L1 DCache 的容量
3. 反映出现瓶颈的指标是什么：时间，周期数，缓存缺失次数
4. 如何构造程序来导致瓶颈出现：在内存中开辟数组，然后不断地扫描访问
5. 程序在什么情况下会导致瓶颈出现：数组大小超过 L1 DCache 容量
6. 程序的参数如何对应到设计参数上：数组的大小对应到 L1 DCache 的容量

假如要设计一个针对 ROB(ReOrder Buffer) 容量的测试，思考同样的要素：

1. 针对什么微架构部件：ROB
2. 针对该部件的什么设计参数：ROB 能容纳多少条指令
3. 反映出现瓶颈的指标是什么：时间，周期数
4. 如何构造程序来导致瓶颈出现：在 ROB 开头和结尾各放一条长延迟指令，中间填充若干条指令
5. 程序在什么情况下会导致瓶颈出现：如果指令填充得足够多，导致结尾的长延迟指令不能进入 ROB，那么它无法被预测执行
6. 程序的参数如何对应到设计参数上：把结尾的长延迟指令阻拦在 ROB 之外时，在 ROB 中的指令数

思考明白这些要素，就可以知道怎么设计出一个 Microbenchmark 了。

原理介绍完了，下面介绍一些常用的方法。

指标的获取¶

上面提到，为了反映出瓶颈，需要有一个指标，它最好能够精确地反映出瓶颈的发生与否，同时也尽量要减少噪声。能用的指标不多，只有两类：

1. 时间：最通用，所有平台都可以用，在程序前后各记一次时间，取差
2. 性能计数器：使用起来比较麻烦，有时需要 root 权限，或者硬件相关信息不公开，又或者硬件就没有实现对应的性能计数器。各平台性能计数器可用情况： 1. Windows：可用，有现成 API 2. macOS：可用，有逆向出来的私有框架 API 3. Linux：可用，有现成 API 4. iOS：目前仅可通过 XCode 使用，不好用 5. Android：需要 root 或通过 adb shell 使用，比较麻烦 6. HarmonyOS NEXT：不可用

虽然测时间最简单也最通用，但它会受到频率波动的限制，如果在运行测试的时候，频率剧烈变化（特别是手机平台），引入了大量噪声，就会导致有效信息被淹没在噪声当中。

其中性能计数器是最为精确的，虽然使用起来较为麻烦，但也确实支撑了很多更深入的 CPU 微架构的逆向。希望硬件厂商看到这篇文章，不要为了避免逆向把性能计数器藏起来：因为它对于应用的性能分析真的很有用。具体怎么用性能计数器，可以参考一些现成的 Microbenchmark 框架。

套路¶

接下来介绍一些构造瓶颈的一些常见套路：

1. 测试容量（比如各级 I/D Cache 和 TLB）：构造一个程序，去把容量用满，当容量被用满的时候，就可以观察到性能下降
2. 测试微架构队列或 Buffer 深度（比如 ROB，寄存器堆，调度队列）：在队列开头通过指令堵住队列的出队，接着不断地向队列中入队新的指令，当队列满的时候，不再能够入队新的指令，此时再引入一些原来不会被堵住的指令，现在因为队列被堵住了而进不去，导致性能下降
3. 测试组相连结构（比如 BTB，Cache 等组相连结构）：组相连结构下，每个 Index 内的容量是固定的，通过测试容量，可以得到有多少 Index 被覆盖了，如果通过修改 Index 函数的输入（比如 PC），使得某些 Index 无法被访问到，就可以观察到容量上的减少，并且实际容量也反馈出了还有多少 Index 能够被访问到的信息
4. 构造 pointer chasing：以 8B(对应 64 位指针)、缓存行大小或页大小为粒度，进行随机打乱，然后把它们用指针串联起来，前一个指针指向的内存中保存后一个指针的地址
5. 构造长延迟指令：在测试指令队列相关的场景下常用，通常可以用 pointer chasing long latency load 或者一段具有串行依赖的浮点除法或开根指令来实现

再介绍一些常见的坑：

1. 尽量用汇编来构造测例，C/C++ 编译器可能会带来不期望的行为
2. 链接器有一些行为可能是需要避免的，例如它可能会修改一些指令
3. 链接器还可能有一些局限性，例如它不支持巨大的对齐

现成 Microbenchmark¶

实际上，现在已经有很多现成的 Microbenchmark，以及一些记录了 Microbenchmark 的文档：

● https://www.agner.org/optimize/
● https://github.com/clamchowder/Microbenchmarks/
● https://github.com/JamesAslan/MicroArchBench
● https://github.com/name99-org/AArch64-Explore
● https://github.com/jiegec/cpu-micro-benchmarks

以及一些用 Microbenchmark 做逆向并公开的网站：

● https://chipsandcheese.com
● Anandtech（可惜不再更新）
● https://blog.hjc.im/
● https://www.zhihu.com/people/jamesaslan
● 本博客

如果你想要去逆向某个微架构的某个部件，但不知道怎么做，不妨在上面这些网站上寻找一下，是不是已经有现成的实现了。

如果你对如何编写这些 Microbenchmark 不感兴趣，也可以试试在自己电脑上运行这些程序，或者直接阅读已有的分析。

source

11:11 · Dec 17, 2024 · Tue

Chips and Cheese
Rebellions: From High Frequency Trading to AI Acceleration
#ChipAndCheese

Telegraph | source
(author: George Cozma)

Telegraph

Rebellions: From High Frequency Trading to AI Acceleration

Hello you fine Internet folks, At Supercomputing 2024 we stopped by the Rebellions Booth. Rebellions is a Korean startup that originally focused on the High Frequency Trading (HFT) sector and now is transitioning to the AI sector with their second and third…

ChipAndCheese

15:47 · Dec 16, 2024 · Mon

#名言

The best kind of security is the lack of marketshare

名言

08:57 · Dec 16, 2024 · Mon

Daniel Lemire's blog
Accessing the attributes of a struct in C++ as array elements?

In C++, it might be reasonable to represent a URL using a class or a struct made of several strings, like so:

struct basic {
    std::string protocol;
    std::string username;
    std::string password;
    std::string hostname;
    std::string port;
    std::string pathname;
    std::string search;
    std::string hash;
};

You might associate to each component (protocol, username, etc.) an index, like so:

enum class component {
     PROTOCOL = 0,
     USERNAME = 1,
     PASSWORD = 2,
     HOSTNAME = 3,
     PORT = 4,
     PATHNAME  = 5,
     SEARCH = 6,
     HASH = 7,
};

What you might like to do then is to access a component by its index. The following code might do:

std::string& get_component(basic& url, component comp) {
    switch (comp) {
        case component::PROTOCOL: return url.protocol;
        case component::USERNAME: return url.username;
        case component::PASSWORD: return url.password;
        case component::HOSTNAME: return url.hostname;
        case component::PORT:     return url.port;
        case component::PATHNAME: return url.pathname;
        case component::SEARCH:   return url.search;
        case component::HASH:     return url.hash;
    }
}

But what if you are constantly accessing values by their indexes? You might be concerned that the overhead of the switch/case could be too much.

Instead, you might flip the data structure around and store the values in an array within the data structure. The following might work:

struct fat {
    std::array<std::string, 7> data;
    std::string &protocol = data[0];
    std::string &username = data[1];
    std::string &password = data[2];
    std::string &hostname  = data[3];
    std::string &port  = data[4];
    std::string &pathname = data[5];
    std::string &search = data[6];
    std::string &hash = data[7];
};

With this new data structure, getting a component by its index becomes simpler:

std::string& get_component(fat& url, component comp) {
    return url.data[int(comp)];
};

Unfortunately, each reference in the new fat data structure might use 8 bytes. That is not a concern if you expect to have few instances of the data structures. However, if you do, you might want to avoid the references. You might try to replace the references by simple methods:

struct advanced {
    std::array<std::string, 7> data;
    std::string &protocol() { return data[0]; }
    std::string &username() { return data[1]; }
    std::string &password() { return data[2]; }
    std::string &hostname() { return data[3]; }
    std::string &port() { return data[4]; }
    std::string &pathname() { return data[5]; }
    std::string &search() { return data[6]; }
    std::string &hash() { return data[7]; }
};

It is not entirely satisfactory as it requires calling methods instead of accessing attributes.

I am not sure whether you can do any better currently in C++.

source

Before

After