V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
V2EX  ›  maypok86  ›  全部回复第 1 页 / 共 1 页
回复总数  8
1.) Yes, users of cache libraries usually want a map with limited size, ttl and maybe some other set of features and don't want to care about preemptive policies within that library. But if this library doesn't show a good hit ratio, then the rest of its features are of no use at all, because cache misses will waste a huge amount of time on the system. And lru sharding and ignoring get operations in the eviction policy looks like a very small hit ratio (many times smaller than the others). Also as I wrote before, it's not very clear how lru and ttl coexist together in the same heap.

2.) Yes, measuring hit ratio is not a very nice thing to do. But if you don't want to write these checks yourself, you can check them on ristretto, theine or otter benchmarks. All these benchmarks have approximately the same values.

3.) Yes, I once got such strange memorycache benchmarks on otter and it was on pure get when benchcount was very large. I'll try to look at this today tentatively looks like a gc loop, or some bug when accounting for expiration. It would be cool if you could share the full benchmark code so I can profile easier.

4.) I've noticed here that they unironically take off some balance for posts, so let's continue in this issue https://github.com/lxzan/memorycache/issues/13
Anyway, I fiddled around and tested benchmarks of mine and memorycache and the verdict is pretty interesting. Here I won't take trace types into account, because on pure set and get they really don't affect much, but only on mixed operations.

1.) There is a bug in memorycache benchmarks: https://github.com/lxzan/memorycache/blob/main/benchmark/benchmark_test.go#L31 this add eats about half of the benchmark time, since a bunch of goroutines compete for this atomic, it is better to use a local counter per goroutine or fastrand. And on such benchmarks the result is about the same:
goos: darwin
goarch: arm64
pkg: github.com/lxzan/memorycache/benchmark
BenchmarkMemoryCache_Set-8 27370962 83.20 ns/op
BenchmarkMemoryCache_Get-8 62720229 18.51 ns/op
BenchmarkRistretto_Set-8 19263478 101.4 ns/op
BenchmarkRistretto_Get-8 39636226 26.51 ns/op
BenchmarkOtter_Set-8 26425458 39.04 ns/op
BenchmarkOtter_Get-8 40814108 28.50 ns/op
PASS
ok github.com/lxzan/memorycache/benchmark 22.620s

Also on my benchmarks I got proportional results, memorycache was about a third faster than otter on pure get and about twice as slow on pure set, but things got sad when mixed load was applied
BenchmarkCache/Zipf_Memory_reads=75%,writes=25%-8 4453213 289.6 ns/op 13 B/op 0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8 13881976 75.53 ns/op 6 B/op 0 allocs/op
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8 2853026 469.0 ns/op 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8 3112142 390.5 ns/op 46 B/op 1 allocs/op

2.) I really wondered how regular locking with map was able to defeat wait-free ristretto and otter with rare request blocking. I went to the memorycache code and saw this https://github.com/lxzan/memorycache/blob/main/cache.go#L124 that is memorycache in no way accounts for reads of non-expired items in the eviction policy. This is not LRU at all and I don't know what it is because any eviction policy has to take into account both reads and writes. Like for example here https://github.com/hashicorp/golang-lru/blob/main/simplelru/lru.go#L72. I'll try to check the hit ratio tomorrow, but I suspect it will be very small

I'd like some kind of answer to this, to be honest, because I've been told so much about how only memorycache can fight ristretto, but so far it's been disappointing.
@Nazz Ok, maybe I'm misunderstanding something then I'd like to understand how it works in general.

1.) Suppose you use heap for the list in lru. Then why do it at all for O(logn) when you can do it for O(1)? And then how does deleting expirated items even work if the heap is not ordered by expiry time? It seems to me there is no way and memorycache leaves a large number of items alive until it reaches them in lru.

2.) Cutting lru into shards looks questionable in terms of hit ratio because one of the shards may simply have more items than the others and will constantly evict good items.

3.) Yes, redis can store id's but that's not what I was writing about. In a trace of random strings, all elements occur only once. And querying each of them is a cache miss but in reality there is a set of most frequent elements, which is what we want to store. And in memorycache benchmarks, any new item causes the caches to constantly evict items which would not happen in reality. It would also be cool to see benchmarks with a combination of set and get operations. And hit ratio comparison too.
@Nazz lol, the heap isn't there for what you're using. The heap is there just to avoid going through all the values in lfu to find the most rare element. You're using the heap for ttl (no question about that) and as an eviction policy you're just removing the element with the lowest expiry time, which is terrible. This gives you the speed to beat ristretto (since ristretto has to completely lock the eviction policy for all shards) but the hit ratio of such a cache will be horrible, because such actions are not really different from removing the first inserted item
@Nazz

1.) any cache library is capable of replacing redis in light-use scenarios and it doesn't give MC any advantage :)

2.) The claim that MC is a rival to Ristretto at all is very funny because MC simply doesn't have an eviction policy (MC just removes the element with the smallest expiration time, which isn't even LRU as written in the README). And in fact the main problem that RTOs solve. As a result we have that MC loses to RTO by hit ratio, loses by performance on real traces Otter and has serious pressure on GC. It seems that a cache library with no eviction policy should offer something in return (like no pressure on GC like bigcache https://github.com/allegro/bigcache). In the end we have that MC is slightly faster than Ristretto but still has a nasty hit ratio because of which the system will spend much more time. And if we compare it to Otter, it will lose on all parameters.

3.) Do you really think that Dgraph labs, PostgreSQL authors and Ben Manes & CO couldn't think of slicing the map into shards with mutex and bolting a heap on there? That's just stupid
362 天前
回复了 cyhone 创建的主题 Go 编程语言 剖析 Golang Bigcache 的极致性能优化
1.) Yes, multi-level caching is common in highload projects, but usually such caching is based on a cache library with a eviction policy that keeps track of the most frequent items and reduces response time for them, and sharded redis or memcached that stores all other items already. Let's try to explain with the example of a backend microservice. Let's say we have a very highload service that uses bigcache and possibly redis and postgresql. The service was deployed a long time ago and has already accumulated the maximum size of bigcache, some of the elements are stored in redis, and for the rest we need to go to the database and other services. Here comes a developer who has just completed his task and wants to redeploy the service and there is a problem: at simple redeployment of the service bigcache will be cleared and redis and postgresql will be flooded with additional requests because of which the system may degrade (in this example not very much, because there is redis but we want to replace redis with bigcache :). The only good solution to this problem that I know of is to use a monotonous canary deploy gradually filling up the bigcache in the service pods but this is not a very nice thing to do. And kubernetes can sometimes restart pods.... In general, I see two main problems in trying to replace redis with bigcache: 1. difficulties with redeployment 2. data consistency, which is much more important. And in multilevel caching more often use libraries with eviction policy simply because such a cache with 10 million elements is able to produce hit ratio more than 80% even on such complex traces as search and database. And gc will survive such a load quite easily.

2.) And this is a bit more fun. As far as I know, go is usually used either for backend services, or cli (a cache with a huge number of elements is simply not needed), or some boxed solutions (like dgraph or jaeger for example). For backend services the choice seems to be obvious in the direction of redis and throwing away bigcache, at least consistency problems are already solved for you and io queries are already greatly reduced due to pipelining. You can also refer to this issue https://github.com/allegro/bigcache/issues/27. Ristretto is even a faster map :) I haven't investigated the maximum number of elements in otter and other caches with eviction policy but I can say for sure that gc in golang can digest 10 million elements in such caches without much delay. (I suspect that at 100 million gc will already be bad but it should be checked) dgraph for example quietly uses ristretto, and from other languages you can easily look at kafka and cassandra, which use caffeine( https://github.com/ben-manes/caffeine), which creates additional pressure on gc.

3.) I haven't encountered such a thing, but it can be (usually it's strings after all).

Conclusion: yes, you can use bigcache, but most likely you are doing something wrong (like allegro in the bigcache article) or you should already know very well what you are doing.

The universal advice is: just use RTO (theine is better at the moment) and if you do run into problems (I doubt it), try bigcache
363 天前
回复了 cyhone 创建的主题 Go 编程语言 剖析 Golang Bigcache 的极致性能优化
Hi, I will write in English but I hope many people will find my opinion and review of cache libraries in Go useful.

So, cache libraries in Go are of two types
1. Cache libraries that use a good eviction policy but put extra pressure on gc
2. Cache libraries that don't use eviction policies (just delete the first inserted element) but don't put extra pressure on gc

Let's talk about the second category first. The main representatives are: fastcache( https://github.com/VictoriaMetrics/fastcache), bigcache( https://github.com/allegro/bigcache) and freecache( https://github.com/coocood/freecache). Ok, when should we use them? It seems that approximately never because the only advantage they give is the absence of pressure on gc, even when storing tens of millions of key-value pairs libraries of the first type will win. Even the article about the creation of bigcache makes me smile https://blog.allegro.tech/2016/03/writing-fast-cache-service-in-go.html . They just wrote a worse version of redis or memcached and got no benefits. The only case when I think that using such libraries is justified is when you write a release-cycle storage where data will be stored in ram, like VictoriaMetrics( https://github.com/VictoriaMetrics), for which fastcache was written (it is faster than bigcache by the way). But for the rest there is no point in such libraries, as a good eviction policy will give much more cache hits than trying to store gigabytes of additional data. Also, these libraries do not really reduce the load on gc because they work only with strings/slices of bytes, which forces to convert other types into them, which firstly takes a lot of time and secondly greatly increase the pressure on gc, which has to do all this, and also leads to a strong fragmentation of memory (external because of string allocation and internal, because after deleting items in memory allocated by these caches huge holes are formed).

Now let's talk about libraries of the first type.

There are two types of libraries: slow but simple libraries with global locking (the vast majority of them) like https://github.com/hashicorp/golang-lru and faster libraries that try to avoid global locking ristretto( https://github.com/dgraph-io/ristretto), theine( https://github.com/Yiling-J/theine-go) and I'm writing a much faster alternative to them otter( https://github.com/maypok86/otter). RTO (ristretto theine otter) is already suitable for many more users, as it gives a good eviction policy and is more user friendly. Ok, which one should I choose then? Let's take a look at them in order
1. Ristretto. DON'T USE IT AT ALL. It has a terrible hit ratio and the authors don't answer questions about it (and basically don't answer anything). You can see it here https://github.com/dgraph-io/ristretto/issues/346 and here https://github.com/dgraph-io/ristretto/issues/336. It also allows eviction policy updates to be lost after being added to the map, which is actually a memory leak. It also replaces keys with their hash, which can cause you to run into zombies. And other problems
2. Theine. A good library that I don't really have any complaints about, except that its speed degrades already 100000 items to cache level with LRU and global locking but in return it provides a great hit ratio
3. Otter. Do not use it yet, as it is not production-ready yet. Although the intermediate results are very impressive: otter is more than 5 times faster than ristretto and theine, and on most traces from ristretto's benchmarks outperforms all of them by hit ratio

Somehow, I hope it was useful because I meet a lot of misunderstandings on this topic
Hey guys, the author of Otter is here. I only understand Chinese with google translator so using this site was quite difficult for me but I think I did well. :)

So I'll try to answer the discussion (at least what I understood) in English, if you don't mind. I will call the Ristretto/Theine/Otter set of libraries as RTO.

1. I think the benchmarks in memorycache are absolutely not representative because they use a terrible trace for all caches with a good eviction policy, since a trace of completely random strings makes the cache roll into single thread execution, constantly preempting keys due to lack of matching. In such conditions all the techniques used in RTO for acceleration not only do not help, but also harm + the cost of eviction policy also harms, simply because such a policy wastes extra CPU time and does not give any benefits. As an example: with high probability in Java, a map with mutex and heap to store the expiration time will overtake Caffeine on a trace of random strings but you can't go to Ben Manes and say that you killed his Caffeine. :) RTO libraries use zipfian (Theine) or more representative scrambled zipfian (Ristretto and Otter) for this reason and on such traces memorycache already loses seriously.

BenchmarkCache/Zipf_Otter_reads=100%,writes=0%-10 25730020 43.56 ns/op 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Theine_reads=100%,writes=0%-10 11104183 95.58 ns/op 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=100%,writes=0%-10 24799726 46.79 ns/op 16 B/op 1 allocs/op
BenchmarkCache/Zipf_Memory_reads=100%,writes=0%-10 8501631 143.7 ns/op 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-10 18628352 62.30 ns/op 6 B/op 0 allocs/op
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-10 2528754 456.8 ns/op 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-10 2830969 418.2 ns/op 45 B/op 1 allocs/op
BenchmarkCache/Zipf_Memory_reads=75%,writes=25%-10 3291450 376.4 ns/op 9 B/op 0 allocs/op

I will not give results for other load types, but the results are not better there. As a result, we have that the results of these benchmarks simply cannot be trusted.

2. About code complexity. Yes, Otter's code is more complicated to understand, but it demonstrates better perfomance and hit ratio. Because of this, the advantage of simple code is not very clear if a user doesn't need to understand the library and its intricacies. The user just wants to start using the library and have it show good results, and Otter does that. (I could give an example about DB or caffeine, for example, but I'm lazy).

3. About the hash table used in Otter and the pressure on the gc. Spoiler: I found this argument rather strange, but let's look into it. First of all about the hash table in Otter: yes, Otter uses a hash table based on map from xsync library, but it uses a different architecture based on seq locks and a number of hacks that puzpuzpuz couldn't use when implementing the hash table. Now what was puzpuzpuz talking about when mentioning the higher gc pressure? It's about the fact that sync.Map uses two standard maps inside itself, where key and value are stored without creating additional structure for key-value pair (aka sync.Map does not create structure of &Entry{key, value} kind, which is leaked to the heap). xsync.Map does create this structure, which causes additional allocations and increases the load on gc which has to keep track of pointers). Okay, what about Otter? When inserting a key-value pair, Otter creates a structure like &Entry{key, value} and gives this structure to the hash table, which already knows how to work with these structures (compare keys, etc.). Wait, but then Otter additionally pressure gc, as described above. Unfortunately, yes, but other cache libraries do even worse things. Let's break down why.

Here are some links to the main data structures in the rest of the cache libraries:

https://github.com/Yiling-J/theine-go/blob/main/internal/store.go#L39 (here things are even sadder from gc's point of view, because theine not only stores pointers to a key-value pair, but also duplicates the value of keys in two places (if the key is int it's still ok, but if the key is a string it will end up storing two pointers to strings (we'll forget about the overhead with len and cap) and *Entry[K, V]. Which puts more pressure on gc than Otter, obviously.

https://github.com/dgraph-io/ristretto/blob/main/store.go#L118 (ristretto doesn't seem to store pointers, but there's a catch), because 1. they replace the key with a hash, which can lead to terrible errors 2. stores the time completely (along with the location pointer) 3. allocation when adding a key still happens https://github.com/dgraph-io/ristretto/blob/main/cache.go#L285

https://github.com/lxzan/memorycache/blob/main/cache.go#L289 and https://github.com/lxzan/memorycache/blob/main/types.go#L18 are sad here. 1. there is not even a hint of generics 2. the key is again stored twice and chokes gc 3. the value is stored as type any (which is actually interface{}). If you dive a bit deeper into the implementation of interfaces in go, you'll learn that interfaces store a pointer to the value contained in them. As a result, we have that even the value separately leaks into the heap. This puts terrible pressure on gc.

4. About the speed of the otter.
In fact, this is due to several factors: Otter uses hash table, which is faster than the standard one, queue, which is faster than the standard one, wait-free buffers, which are faster than the implementation from Theine, counters, which outperform atomic.Int64 in cases of frequent writes, approximate time for TTL, honest batching for access to the eviction policy (Otter does not waste time on frequent lock/unlock operations with full read and write buffers) and a few more ideas.

Unfortunately, I haven't had time lately to finish the work for Otter, but in the next few weeks, everything should be calmer and I hope to finally finish it.

And a huge thank you for even noticing Otter, because I got the impression that even the eye-catching title doesn't interest anyone and everyone just walks past it.

P.S. how long you have to wait on this site to be able to write something...
关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2973 人在线   最高记录 6679   ·     Select Language
创意工作者们的社区
World is powered by solitude
VERSION: 3.9.8.5 · 11ms · UTC 13:20 · PVG 21:20 · LAX 05:20 · JFK 08:20
Developed with CodeLauncher
♥ Do have faith in what you're doing.