How We Spotted and Fixed a Performance Degradation in Our Python Code
文章转自redash 团队的一个实践,是关于py-spy 使用的一个不错的例子
Recently we’ve started transitioning from using Celery to using RQ as our task running engine. For phase one, we only migrated the jobs that aren’t directly running queries. These jobs include things like sending emails, figuring out which queries need to be refreshed, recording user events and other maintenance jobs.
After deploying this we noticed our RQ workers required significantly more CPU to perform the same amount of tasks as our Celery workers did. I thought I’d share how I profiled this and remedied the problem.
A word about the differences between Celery and RQ
Both Celery and RQ have the concept of a Worker process and both use forking to allow parallelized execution of jobs. When you launch a Celery worker it forks to several different processes, each one autonomously handles tasks. With RQ, a worker will only instantiate a single sub-process (known as a “Work Horse”) which will perform a single job and then die. When the worker fetches another job from the queue, it will fork a new Work Horse.
In RQ you can achieve the same parallelism as Celery simply by running more worker processes. However there is a subtle difference between Celery and RQ: Celery workers instantiate multiple subprocesses at launch time and reuse them for multiple tasks. With RQ, you have to fork on every job. There are pros and cons to both approaches, but these are out of scope for this post.
Benchmarking
Before I profiled anything, I wanted a benchmark of how long it takes for a worker container to process 1000 jobs. I decided to focus on the record_event
job since it is a frequent, lightweight operation. I used the time
command to measure performance, which required a couple changes to the source code:
- To measure these 1000 jobs I preferred RQ’s burst mode, which exits the process after handling the jobs.
- I wanted to avoid measuring other jobs that might be scheduled at the time of benchmarking. So I moved
record_event
to a dedicated queue calledbenchmark
by replacing@job(‘default’)
with@job(‘benchmark’)
right aboverecord_event
’s declaration intasks/general.py
.
Now we can start timing things. First of all, I wanted to see how long it takes for a worker to start and stop (without any jobs) so I can subtract that time from any result later.
$ docker-compose exec worker bash -c "time ./manage.py rq workers 4 benchmark"
real 0m14.728s
user 0m6.810s
sys 0m2.750s
Worker initialization takes 14.7 seconds on my machine. I’ll remember that.
Then, I shoved 1000 dummy record_event
jobs onto the benchmark
queue:
$ docker-compose run --rm server manage shell <<< "from redash.tasks.general import record_event; [record_event.delay({ 'action': 'create', 'timestamp': 0, 'org_id': 1, 'user_id': 1, 'object_id': 0, 'object_type': 'dummy' }) for i in range(1000)]"
Now let’s run the same worker command as before and see how long it takes to process 1,000 jobs:
$ docker-compose exec worker bash -c "time ./manage.py rq workers 4 benchmark"
real 1m57.332s
user 1m11.320s
sys 0m27.540s
Deducting our 14.7 boot time, we see that it took 102 seconds for 4 workers to handle 1,000 jobs. Now let’s try to figure out why! For that, we’ll use py-spy
while our workers are working hard.
Profiling
Let’s add another 1,000 jobs (because our last measurement consumed all of them), run the workers and simultaneously spy on them:
$ docker-compose run --rm server manage shell <<< "from redash.tasks.general import record_event; [record_event.delay({ 'action': 'create', 'timestamp': 0, 'org_id': 1, 'user_id': 1, 'object_id': 0, 'object_type': 'dummy' }) for i in range(1000)]"
$ docker-compose exec worker bash -c 'nohup ./manage.py rq workers 4 benchmark & sleep 15 && pip install py-spy && rq info -u "redis://redis:6379/0" | grep busy | awk "{print $3}" | grep -o -P "\s\d+" | head -n 1 | xargs py-spy record -d 10 --subprocesses -o profile.svg -p'
$ open -a "Google Chrome" profile.svg
I know, that last command is quite a handful. Ideally I would break that command on every ‘&&’ for readability, but the commands should run sequentially inside the same docker-compose exec worker bash
session, so here’s a quick breakdown of what it does:
- Start 4 burst workers in the background
- Wait 15 seconds (roughly to get them to finish booting)
- Install py-spy
- Run
rq-info
and slice up the pid for one of the workers - Record 10 seconds of activity in that pid and save it to
profile.svg
The result was this flame graph:
Digging inside the flame graph I noticed that record_event
spends a big portion of its execution time in sqlalchemy.orm.configure_mappers
and it happens on every job execution. From their docs I saw this:
Initialize the inter-mapper relationships of all mappers that have been constructed thus far.
This sort of thing really doesn’t need to happen on every fork. We could initialize these relationships in the parent worker once and avoid the repeated effort in the work horses.
So I’ve added a call to sqlalchemy.org.configure_mappers()
before starting the work horse and measured again:
$ docker-compose run --rm server manage shell <<< "from redash.tasks.general import record_event; [record_event.delay({ 'action': 'create', 'timestamp': 0, 'org_id': 1, 'user_id': 1, 'object_id': 0, 'object_type': 'dummy' }) for i in range(1000)]
$ docker-compose exec worker bash -c "time ./manage.py rq workers 4 benchmark"
real 0m39.348s
user 0m15.190s
sys 0m10.330s
If we deduct our 14.7 second boot time, we are down from 102 seconds to 24.6 seconds for 4 workers to handle 1000 jobs. That’s a 4x improvement! With this fix we managed to slice our RQ production resources by 4 and keep the same throughput.
My key takeaway here is that you should keep in mind that your app behaves differently when it’s a single process and when it forks. If there’s some heavy lifting to perform on every job, it’s usually a good idea to do it once before the fork. These kinds of things don’t show up during testing and development, so make sure you measure well and dig into any performance issues that come up.
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
2019-04-22 python 集成cython 简单测试
2019-04-22 click python cli 开发包
2014-04-22 asp.net excel 操作
2014-04-22 gridview 自动序号 合计