"All of this is made possible with the inclusion of frame pointers in all of Meta’s user space binaries, otherwise we couldn’t walk the stack to get all these addresses (or we’d have to do some other complicated/expensive thing which wouldn’t be as efficient)"
This makes things so, so, so much easier. Otherwise, a lot of effort has to built into creating an unwinder in ebpf code, essentially porting .eh_frame cfa/ra/bp calculations.
They claim to have event profilers for non-native languages (e.g. python). Does this mean that they use something similar to https://github.com/benfred/py-spy ? Otherwise, it's not obvious to me how they can read python state.
Thanks! Those blogs are incredibly useful. Nice work on the profiler. :)
I have multiple questions if you don’t mind answering them:
Is there significant overhead to native unwinding and python in ebpf? EBPF needs to constantly read & copy from user space to read data structures.
I ask this because unwinding with frame pointers can be done by reading without copying in userland.
Python can be ran with different engines (cpython, pypy, etc) and versions (3.7, 3.8,…) and compilers can reorganize offsets. Reading from offsets in seems me to be handwavy. Does this work well in practice/when did it fail?
Overhead ultimately depends on the frequency, it defaults to 19hz per core, at which it’s less than 1%, which is tried and tested with all sorts of super heavy python, JVM, rust, etc. workloads. Since it’s per core it tends to be plenty of stacks to build statistical significance quickly. The profiler is essentially a thread-per-core model, which certainly helps for perf.
The offset approach has evolved a bit, it’s mixed with some disassembling today, with that combination it’s rock solid. It is dependent on the engine, and in the case of python only support cpython today.
I would assume the name is a reference to the use of strobes in examining high speed periodic motion, like that in motors or on production lines, eg:
https://www.checkline.com/inspection_stroboscope
That's really cool. I only wish open source projects were this integrated. (Imagine if making a PR would estimate your AWS cost increase after running canary Kubernetes.)
Also what's really cool to see is that Facebook's internal UI actually looks decent. Never work in a company of anywhere close to that size and the tooling always look like it was puked by a dog.
This is really cool! I've always thought that one thing preventing major competitors to AWS/Azure/GCP is the lack of easy-to-use tooling for machine level monitoring like this. When I was at Microsoft, we built a tool like this that used Windows Firewall filters to track all the network traffic between our services and it was incredibly useful for debugging.
That said, as with anything from Meta, I approach this with a grain of salt and the fact that I can't tell what they stand to gain from this makes me suspicious.
> the fact that I can't tell what they stand to gain from this makes me suspicious.
Meta is one of the biggest contributors to FOSS in the world. (React, PyTorch, Llama, …). They stand to gain what every big company does, a community contributing to their infra.
You’ll note that nobody is open sourcing their ad recommender, that is the one you should be skeptical about if you ever see. You don’t share your secret sauce.
> Maybe, but the gold chain, million dollar watch wearing CEO talking about masculine energy doesn't help the brand.
Why not exactly? Between Meta’s great contributions to the open-source ecosystem and Mark behaving more like a normal man nowadays, right now is the only time in a long time that I’ve considered applying to go work at Meta. I’ve heard several of my colleagues and friends say the same thing in recent months.
Perhaps a contributing factor is how HN shows only the final non-eTLD [0] label of the domain. If it showed all labels, you'd have seen "engineering.fb.com" which, while not a dead giveaway, implies that the problem space is technical.
It would be nice if this aggressive truncation were applied only above a certain threshold of length.
We are actually saying different things, and your point highlights an error in mine (i.e., I assumed they show the eTLD from the PSL plus one extra label, but apparently they have their own shadow PSL which omits things like pp.se and therefore occasionally shows nothing but an eTLD?) but either way we agree that showing more would be better.
We’re working hard to bring a lot of Strobelight to everyone through Parca[0] as OSS and Polar Signals[1] as the commercial version. Some parts already exists much to come this year! :)
At Yandex we have a similar profiler that supports native languages seamlessly, with addition to Python/Java: https://github.com/yandex/perforator. It's exciting to see new profilers from big players!
Between LLVM's optimization passes, static analysis, and modern LLM-powered tools, couldn't we build systems that not only identify but automatically fix these performance issues? GitHub Copilot already suggests code - why not have "Copilot Performance" that refactors inefficient patterns?
I'm curious if anyone is working on "self-healing" systems where the optimization feedback loop is closed automatically rather than requiring human engineers to parse complex profiling data.
> A seasoned performance engineer was looking through Strobelight data and discovered that by filtering on a particular std::vector function call (using the symbolized file and line number) he could identify computationally expensive array copies that happen unintentionally with the ‘auto’ keyword in C++.
> The engineer turned a few knobs, adjusted his Scuba query, and happened to notice one of these copies in a particularly hot call path in one of Meta’s largest ads services. He then cracked open his code editor to investigate whether this particular vector copy was intentional… it wasn’t.
> It was a simple mistake that any engineer working in C++ has made a hundred times.
> So, the engineer typed an “&” after the auto keyword to indicate we want a reference instead of a copy. It was a one-character commit, which, after it was shipped to production, equated to an estimated 15,000 servers in capacity savings per year!
It's a cool anecdote. It's also a case study in heavyweight copies being something that shouldn't happen by default, and should require explicit annotation indicating that the engineer expects a heavyweight copy of the entire structure.
If it's safety/correctness versus performance, I think the default should be the former. Copying, while inefficient is generally more correct and avoids hard-to-debug errors. It's the whole discussion about premature optimization.
I'd rather make a copy than make sure the array is not mutated anywhere ever.
Yes, everyone agrees with you. The claim you responded to was that you should have to be explicit, because it is very easy to unintentionally copy. For example, it is easy to copy when there is never more than one live pointer to a datastructure. It's easy to copy when you allocate a resource in a function and return it, which makes the original an orphan which is then immediately freed. It's extremely easy to make a mistake which prevents move from working and you have to go back and carefully check if you want to be sure. It should be trivial to just say "move this" and if something isn't right it's an error at compile time, rather than just falling back to silently being wasteful.
I'm not saying it should silently alias any more than it should silently copy. It should give an error, and require the developer to explicitly copy or explicitly alias.
Imagine how much server capacity we could save if we didn't waste the equivalent electrical consumption of Belgium convincing your mother she needs more garbage from Temu.
I recommand https://grafana.com/oss/pyroscope/ for continous profiling, I use it in Go and it works well.
They have support for many languages https://grafana.com/docs/pyroscope/latest/configure-client/l... ( also based on eBPF ).
Good to know there's an OSS alternative.
Strobelight is open source as well.
The Otel profiling agent (formerly prodfiler, then Elastic profiler) is the underlying OSS.
And open sourcing: https://github.com/facebookincubator/strobelight
C++ from Meta/FB is much more pleasant to read than ones from ... other older big techs. I appreciate that.
I only see around three .cpp files in the entire project?
Look at other FB projects.
Strobelight is a lifesaver. Especially with high qps services - makes it much easier to see where it's worth spending time trying to optimize.
"All of this is made possible with the inclusion of frame pointers in all of Meta’s user space binaries, otherwise we couldn’t walk the stack to get all these addresses (or we’d have to do some other complicated/expensive thing which wouldn’t be as efficient)"
This makes things so, so, so much easier. Otherwise, a lot of effort has to built into creating an unwinder in ebpf code, essentially porting .eh_frame cfa/ra/bp calculations.
They claim to have event profilers for non-native languages (e.g. python). Does this mean that they use something similar to https://github.com/benfred/py-spy ? Otherwise, it's not obvious to me how they can read python state.
Lastly, the github repo https://github.com/facebookincubator/strobelight is pretty barebones. Wonder when they'll update it
Already been done:
1) native unwinding: https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...
2) python: https://www.polarsignals.com/blog/posts/2023/10/04/profiling...
Both available as part of the Parca open source project.
https://www.parca.dev/
(Disclaimer I work on Parca and am the founder of Polar Signals)
Thanks! Those blogs are incredibly useful. Nice work on the profiler. :)
I have multiple questions if you don’t mind answering them:
Is there significant overhead to native unwinding and python in ebpf? EBPF needs to constantly read & copy from user space to read data structures.
I ask this because unwinding with frame pointers can be done by reading without copying in userland.
Python can be ran with different engines (cpython, pypy, etc) and versions (3.7, 3.8,…) and compilers can reorganize offsets. Reading from offsets in seems me to be handwavy. Does this work well in practice/when did it fail?
Thank you!
Overhead ultimately depends on the frequency, it defaults to 19hz per core, at which it’s less than 1%, which is tried and tested with all sorts of super heavy python, JVM, rust, etc. workloads. Since it’s per core it tends to be plenty of stacks to build statistical significance quickly. The profiler is essentially a thread-per-core model, which certainly helps for perf.
The offset approach has evolved a bit, it’s mixed with some disassembling today, with that combination it’s rock solid. It is dependent on the engine, and in the case of python only support cpython today.
Short note: Also available as the standard Otel profiling agent ;)
I would assume the name is a reference to the use of strobes in examining high speed periodic motion, like that in motors or on production lines, eg: https://www.checkline.com/inspection_stroboscope
That's really cool. I only wish open source projects were this integrated. (Imagine if making a PR would estimate your AWS cost increase after running canary Kubernetes.)
Also what's really cool to see is that Facebook's internal UI actually looks decent. Never work in a company of anywhere close to that size and the tooling always look like it was puked by a dog.
DOPE
Fractal compute expense modeelling is hard.
One may do well in applying fluid dynamics (such that we cannot maintain in head)
into compute requirements, it will be funny once we realize that everything i mico (pico) fluid dynamics in general
This is really cool! I've always thought that one thing preventing major competitors to AWS/Azure/GCP is the lack of easy-to-use tooling for machine level monitoring like this. When I was at Microsoft, we built a tool like this that used Windows Firewall filters to track all the network traffic between our services and it was incredibly useful for debugging.
That said, as with anything from Meta, I approach this with a grain of salt and the fact that I can't tell what they stand to gain from this makes me suspicious.
> the fact that I can't tell what they stand to gain from this makes me suspicious.
Meta is one of the biggest contributors to FOSS in the world. (React, PyTorch, Llama, …). They stand to gain what every big company does, a community contributing to their infra.
You’ll note that nobody is open sourcing their ad recommender, that is the one you should be skeptical about if you ever see. You don’t share your secret sauce.
> You’ll note that nobody is open sourcing their ad recommender
Actually... (2019) https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-l...
Source code:
https://github.com/facebookresearch/dlrm
Paper:
https://arxiv.org/abs/1906.00091
Updated 2023 blog post, but solely for content recommendation, but ads recommendation is ~90% the same:
https://engineering.fb.com/2023/08/09/ml-applications/scalin...
It's a little out of date, but the internal one is built with the same concepts, just more advanced modeling techniques and data.
ByteDance shared the TikTok content recommender, which I'd argue is somewhat close to an ad recommender :)
You mean the paper, not the source code?
Plus it helps them recruit engineers who are already familiar with their tech stack.
As a sibling commenter said, it helps brand and recruiting - which meta cares about
Maybe, but the gold chain, million dollar watch wearing CEO talking about masculine energy doesn't help the brand.
> Maybe, but the gold chain, million dollar watch wearing CEO talking about masculine energy doesn't help the brand.
Why not exactly? Between Meta’s great contributions to the open-source ecosystem and Mark behaving more like a normal man nowadays, right now is the only time in a long time that I’ve considered applying to go work at Meta. I’ve heard several of my colleagues and friends say the same thing in recent months.
Imagining that there's anything "normal" about that knucklehead is why "masculinity" is such an easy target for parody.
What's unattractive about how do you do fellow humans?
> Imagining that there's anything "normal" about that knucklehead is why "masculinity" is such an easy target for parody.
You’re certainly entitled to your opinions and ad hominems. Many folks, including myself, disagree with you, so there’s that.
Yep, and you yours of course.
But man is that dude a bad example of how to be a human.
I'll cut him some slack for growing up in public with stupid money and no one to regulate his impulses, but uff da.
Wake me up when he's old enough for his lagging prefrontal cortex to catch up with the rest of him.
Ah, this is performance profiling.
Seeing the title and the domain I thought this was user profiling and I was wondering why would Meta be publishing this.
> the domain
Perhaps a contributing factor is how HN shows only the final non-eTLD [0] label of the domain. If it showed all labels, you'd have seen "engineering.fb.com" which, while not a dead giveaway, implies that the problem space is technical.
It would be nice if this aggressive truncation were applied only above a certain threshold of length.
[0] https://en.wikipedia.org/wiki/Public_Suffix_List
I suggested this 10 years ago. <https://news.ycombinator.com/item?id=8911044>
We are actually saying different things, and your point highlights an error in mine (i.e., I assumed they show the eTLD from the PSL plus one extra label, but apparently they have their own shadow PSL which omits things like pp.se and therefore occasionally shows nothing but an eTLD?) but either way we agree that showing more would be better.
We’re working hard to bring a lot of Strobelight to everyone through Parca[0] as OSS and Polar Signals[1] as the commercial version. Some parts already exists much to come this year! :)
[0] https://www.parca.dev/
[1] https://www.polarsignals.com/
(Disclaimer: founder of polar signals)
At Yandex we have a similar profiler that supports native languages seamlessly, with addition to Python/Java: https://github.com/yandex/perforator. It's exciting to see new profilers from big players!
Between LLVM's optimization passes, static analysis, and modern LLM-powered tools, couldn't we build systems that not only identify but automatically fix these performance issues? GitHub Copilot already suggests code - why not have "Copilot Performance" that refactors inefficient patterns?
I'm curious if anyone is working on "self-healing" systems where the optimization feedback loop is closed automatically rather than requiring human engineers to parse complex profiling data.
I just wish Meta would open source Scuba.
Cool anecdote from inside article
> A seasoned performance engineer was looking through Strobelight data and discovered that by filtering on a particular std::vector function call (using the symbolized file and line number) he could identify computationally expensive array copies that happen unintentionally with the ‘auto’ keyword in C++.
> The engineer turned a few knobs, adjusted his Scuba query, and happened to notice one of these copies in a particularly hot call path in one of Meta’s largest ads services. He then cracked open his code editor to investigate whether this particular vector copy was intentional… it wasn’t.
> It was a simple mistake that any engineer working in C++ has made a hundred times.
> So, the engineer typed an “&” after the auto keyword to indicate we want a reference instead of a copy. It was a one-character commit, which, after it was shipped to production, equated to an estimated 15,000 servers in capacity savings per year!
That one diff blew my mind when I saw it. It’s a prime example of that story about “you paid me a lot of money to know where to fix that pipe”
It's a cool anecdote. It's also a case study in heavyweight copies being something that shouldn't happen by default, and should require explicit annotation indicating that the engineer expects a heavyweight copy of the entire structure.
I don’t know if that would have helped here, if memory serves me right:
1. The copy was needed initially 2. This structure wasn’t as heavy back then
… over time the code evolved so it became heavy and the copy became unnecessary. That’s harder to find without profiling to guide things
If it's safety/correctness versus performance, I think the default should be the former. Copying, while inefficient is generally more correct and avoids hard-to-debug errors. It's the whole discussion about premature optimization. I'd rather make a copy than make sure the array is not mutated anywhere ever.
Yes, everyone agrees with you. The claim you responded to was that you should have to be explicit, because it is very easy to unintentionally copy. For example, it is easy to copy when there is never more than one live pointer to a datastructure. It's easy to copy when you allocate a resource in a function and return it, which makes the original an orphan which is then immediately freed. It's extremely easy to make a mistake which prevents move from working and you have to go back and carefully check if you want to be sure. It should be trivial to just say "move this" and if something isn't right it's an error at compile time, rather than just falling back to silently being wasteful.
This exact problem is basically why Rust exists.
I'm not saying it should silently alias any more than it should silently copy. It should give an error, and require the developer to explicitly copy or explicitly alias.
[flagged]
Tired vote-bait quote.
Only because the Overton window has shifted enough to normalize it.
Imagine how much server capacity we could save if we didn't waste the equivalent electrical consumption of Belgium convincing your mother she needs more garbage from Temu.
And then how would we pay for that server capacity?
[flagged]
[flagged]