Post #996

@TheB1ackParade

Welcome to the Black Parade

Views497Post view count

PostedJan 1001/10/2026, 10:10 AM

Post content

最近一直在高强度依赖 socket cookie 构建模型，在考虑生命期（sk cookie 在 sk release 之后被复用吗）和数值分部（如果足够局部也许能很方便用 robin hood map 做性能优化）的时候看了一眼内核的 cookie 生成代码，很有趣 🤔 static __always_inline u64 gen_cookie_next(struct gen_cookie *gc) { struct pcpu_gen_cookie *local = this_cpu_ptr(gc->local); u64 val; if (likely(local_inc_return(&local->nesting) == 1)) { val = local->last; if (__is_defined(CONFIG_SMP) && unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0)) { s64 next = atomic64_add_return(COOKIE_LOCAL_BATCH, &gc->forward_last); val = next - COOKIE_LOCAL_BATCH; } local->last = ++val; } else { val = atomic64_dec_return(&gc->reverse_last); } local_dec(&local->nesting); return val; } 核心思想是 likely(nesting == 1) 内层 + unlikely(val & 0xfff == 0) 外层的快路径：每个 CPU 分配 COOKIE_LOCAL_BATCH (4096) 个 per-cpu local ID，在非重入情况下直接 ++val 这个 local ID 作为 cookie 返回，零锁零原子指令，性能巨魔，只有在用完 4096 个 cookie 之后，走一次慢路径，用原子指令分配新的 4096 slot；或者在重入情况下走另一条原子指令直接返回。（手机看下图缩进混乱警告🔞） cpu0 cpu0 (used up) (val=4096*2+2048) │ │ ▼ ▼ ┌─────────┬─────────┬─────────┐ │ 4096 │ 4096 │ 4096 │ └─────────┴─────────┴─────────┘ ▲ │ cpu2 (val=4096+100) 利用这个思想我们可以设计出极致性能的线程安全 ID 生成器，在用户态甚至更简单一点因为可以用 TLS thread local，或者更简单一点显式使用 thread local generator，比如 go 可以写成这样 const Batch = 4096 type Global struct { forward_last atomic.Uint64 } type Local struct { last uint64 _ [64 - 8]byte // Cacheline forward_last *atomic.Uint64 } func NewGlobal() *Global { return &Global{} } func (g *Global) GetLocal() *Local { return &Local{forward_last: &g.forward_last} } func (l *Local) NextID() uint64 { val := l.last if (val & (Batch - 1)) == 0 { base := l.forward_last.Add(Batch) - Batch val = base } val++ l.last = val return val } 在合适的场景下，如 “goroutine 总量不多但每个 goroutine 高并发地生成拳交唯一 ID”的场景下（其实和内核的 per cpu ptr 类似），这个算法完爆 atomic.Uint64 和 sync.Mutex $ taskset -c 1,2,4 ./go_nextid.test -test.bench=. cpu: Intel(R) Core(TM) Ultra 7 155U BenchmarkNextID_Local-3 1000000000 0.2906 ns/op BenchmarkNextID_Local_Fixed/G=16-3 1000000000 0.1252 ns/op BenchmarkNextID_Local_Fixed/G=128-3 1000000000 0.1352 ns/op BenchmarkNextID_Local_Fixed/G=1024-3 1000000000 0.1436 ns/op BenchmarkAtomic_Add-3 316950108 3.804 ns/op BenchmarkAtomic_Add_Fixed/G=16-3 56497480 21.83 ns/op BenchmarkAtomic_Add_Fixed/G=128-3 54910852 21.90 ns/op BenchmarkAtomic_Add_Fixed/G=1024-3 55797655 22.03 ns/op BenchmarkMutex-3 100000000 10.78 ns/op BenchmarkMutex_Fixed/G=16-3 15218778 85.92 ns/op BenchmarkMutex_Fixed/G=128-3 12138136 103.6 ns/op BenchmarkMutex_Fixed/G=1024-3 10970743 106.1 ns/op 也就快了区区一百倍吧，逃（