Python's GIL is EVIL

Lately I've been doing some Python multi-threading to make the best use of some of our amazing server resources. As I was pondering the reasons why one of our 8-core servers reported 83% idle despite 8 threads banging away, I re-discovered the Global Interpreter Lock.

BLECH!

The GIL enforces Python's requirement that only a single bytecode operation is executed at a time. My nicely coded multi-threaded app was only being executed serially!! Sadly, this seems unlikely to change, even in Python 3000. Last year Guido said:

"Just Say No to the combined evils of locking, deadlocks, lock granularity, livelocks, nondeterminism and race conditions."

I was brought up to believe that threading was dirty and independent communicating processes were the way to go. But even I realize that this just isn't practical in these days of GUIs, multi-core processors, and application servers.

Why does the Python community accept the GIL? Is it because most people only use Python as a scripting language? Are there simple workarounds (e.g. not forking, shared memory, or the like) that I'm missing?

Comments

"Fixed" in python 3.2,

"Fixed" in python 3.2, released today. Fixed in the sense that multi-threaded programs will not be significantly slower than single-threaded programs because of the (new) GIL; but you still cannot expect core utilization beyond 1.

How hard would it to be to

How hard would it to be to have the Python SWIG bindings release Python's GIL on all functions? Could this be done in one place in a .i file, or would each
method need it?
I'm writing a multithreaded server that's currently in Python and don't want to
block when it descends into svn's C code. The networking part is already in C++
and handles connection setup, marshalling, etc, but just gets the GIL when it
calls into Python code. This is all with Ice. I'll handle locking wiuth

RE:

I was brought up to believe that threading was dirty and independent communicating processes were the way to go.

GIL and performance issues

Python has a GIL as opposed to fine-grained locking for several reasons:

--- It is faster in the single-threaded case.

--- It is faster in the multi-threaded case for i/o bound programs.

--- It is faster in the multi-threaded case for cpu bound programs that do their compute-intensive work in C libraries.

--- It makes C extensions easier to write: there will be no switch of Python threads except where you allow it to happen (i.e. between the Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros).

--- It makes wrapping C libraries easier. You don't have to worry about thread-safety. If the library is not thread-safe, you simply keep the GIL locked while you call it.

The GIL can be released by C extensions. Python's standard library releases the GIL around each blocking i/o call. Thus the GIL has no consequence for performance of i/o bound servers. You can thus create networking servers in Python using processes (fork), threads or asynchronous i/o, and the GIL will not get in your way.

Numerical libraries in C or Fortran can similarly be called with the GIL released. While your C extension is waiting for an FFT to complete, the interpreter will be executing other Python threads. A GIL is thus easier and faster than fine-grained locking in this case as well. This constitutes the bulk of numerical work. The NumPy extension releases the GIL whenever possible.

Threads are usually a bad way to write most server programs. If the load is low, forking is easier. If the load is high, asynchronous i/o and event-driven programming (e.g. using Python's Twisted framework) is better. The only excuse for using threads is the lack of os.fork on Windows.

The GIL is a problem if, and only if, you are doing CPU-intensive work in pure Python. Here you can get cleaner design using processes and message-passing (e.g. mpi4py). There is also a 'processing' module in Python cheese shop, that gives processes the same interface as threads (i.e. replace threading.Thread with processing.Process).

Threads can be used to maintain responsiveness of a GUI regardless of the GIL. If the GIL impairs your performance (cf. the discussion above), you can let your thread spawn a process and wait for it to finish.

Rebuttal of performance judgements

> Threads are usually a bad way to write most server programs. If the load is low, forking is easier. If the load is high, asynchronous i/o and event-driven programming (e.g. using Python's Twisted framework) is better.
Apart from the complexity, while what you say cannot be falsified for Python since it lacks real thread support, it has been falsified (i.e. verified wrong) for Java.
See slides 55+ of:
http://www.bytestopshere.com/assets/content/sdwest2008.ppt

Usage of fork requires explicit communication, which is less simple to use. And anyway, allowing usage of both models

About Guido's statement, deadlock predates threads by ages, because shared resources do. Message sending (which is slower than alternatives in multithreaded programs) still allows for deadlocks even when no lock are used.

About external libraries, if a given library is not thread safe, using a lock for that library is just as simple.

The real problem is that since CPython is stuck with using the slow reference counting instead of fast, modern, real GCs, so removing the GIL for CPython makes it twice as slow. Fine-grained threading is slower, but not by that factor. Atomic manipulation of reference counts is what killed performance of GIL-free CPython, and reference counting already really slows down Python. I can't provide benchmarks for this, though.

The even bigger problem is that while almost nobody needs to write JNI code for Java, writing Python extensions in C is something which is commonly mentioned, not only for interfacing with foreign libraries but also to help performance.

Well, the goal of a VM writer is to allow client programmers to ignore performance as far as possible. In Java, inlining allows using getter/setter methods and using smaller methods without performance penalty, while other techniques allows removing the overhead of virtual calls (something that only VM can do, while compiled C++ cannot) removing the need for final, and so on.

pyProcessing

Will be included in python 2.6 and 3.0
http://www.python.org/dev/peps/pep-0371/

http://pyprocessing.berlios.de/

I really stumbled upon your blog and the pep by accident while looking for other things.

Use Erlang!

I think you could consider Erlang and you will probably love it. It has many powerful and elegant list operations just like Python, but it is designed to be totally parallel. No threads, but inter-process messaging. With Erlang it's easy to use all your cores/processors fully.

I think I have to test this

I think I have to test this Erlang...

Consider Scala

You might consider the Scala programming language: I think it offers the best of Python but running on the JVM. It's very nearly as fast as Java (and thus much faster than Python), and sometimes faster. It's more OO and more functional than Java, and I think both Python and Java programmers will be very happy with it. And while Scala is a statically typed language, it's much more pleasant than Java, thanks to type inference and a few other nice features.

And it has scripting capabilities and an interpreter shell, like Python (and unlike Java).

http://www.scala-lang.org/

It's definitely not because

It's definitely not because everyone's using python just for quick scripts, or whatever.

I spent a bit of time poking around the GIL issue a few jobs ago, on a Zope system, and came to the conclusion that there are enough alternate approaches for most problems, that the advantages of Python generally make it worth while. Having said that, we were dependent on the ZODB, Zope's object database, to get multiple things done at once, and it really couldn't manage.

a pity

I think we accept the GIL because we love python so much that by the time we encounter it, we can't give it up. I discovered the dreaded GIL during a class project with genetic algorithms. I had four processors to work with and was hoping for a 4x speed-up, by evaluating four "genes" at a time. It was depressing.

There is Stackless...

There is/was something called Stackless Python which did, I believe, do away with the GIL (it also introduced green threads and continuations, though not necessarily all at the same time).

I'm not sure why they accept the GIL. It could be that they've found that any cases they have where it's really a problem can be dealt with using multiple processes, and ditching it would add significant complexity to the runtime (concurrent garbage collection and having a lock on every object to protect its reference count or a global refcount lock to use when adjusting reference counts are two of the things that come to mind).

But yeah, the present state... I don't know. It does seem that they're somewhat crippling themselves given that increasing parallelism is the direction of increased performance for the forseeable future.

I don't believe that

I don't believe that Stackless did away with the GIL. If it had, we would all be using Stackless.

As for easy concurrency, look at the processing package. It offers an interface similar to the Threading module in Python, only it uses processes. There are queues and other shared data structures that allow the each process to easily pass data around (including sockets on Windows and Linux).

Regarding why Python programmers "accept" the GIL, it's not about "acceptance", it's about "it's really hard to get rid of a single coarse-grained lock and replace it with a fine-grained lock". And doing so doesn't necessarily guarantee an increase in performance.

Although not a definitive

Although not a definitive answer by any means, Stackless certainly seems to keep all my cores working hard when it needs to.

Threads seem to be Evil

Of course, I meant that Jython and IronPython do not have a GIL.

Threads seem to be Evil

Hi,

From what I read, a lot of people believe that Threads are Evil. Guido van Rossum says that one should try to avoid threads as much as possible, because they make programs complex and they lead to deadlocks and race conditions. Some programs have been in production for over 4 years without any problem and suddenly they deadlock! It is incredibly difficult, even for simple multi-threaded programs, to think of every possible scenario. For more info, read this article:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf

Apparently the Stackless python implementation got rid of the GIL, but the overhead was 100%! In other words, programs ran twice as fast on the regular C python interpreter than on Stackless python (on a single CPU). This means that you only got some performance benefit out of Stackless python in heavily-multithreaded programs running on machines with 3 or more CPUs. Thus this project seems to have been abandonned. Guido van Rossum says that it is not worth the effort, and he definitely will not even try, but he still encourages anyone who has (a lot of) spare time to write a GIL-free python interpreter.

In the mean time, if you really want to take benefit of threads in multiple-CPU machines, you may use Jython (based on Java) or IronPython (based on .NET) which both rely on real threads.

Hope this helps

Stackless Python Vs GIL

The Stackless Python project has nothing to do with the GIL. It's basically a fast(er) version of CPython that supports lightweight coroutines within the same thread.

If you want your python application to scale accross multiple cores, use multiple processes. This works very well for database driven web apps.

multi-threading is going to be more and more important

Ability to share Data across multiple execution threads, I think, will be more and more important of the next 10 years as CPUs will become increasingly multi-core
(Intel is talking up about 1000core CPUs)

Ability for a programmer to
specifying 'readonly data' or
'rw data' that is to be shared across threads
will need to a language-level feature for most popular languages.

I am currently evaluating Twisted, and came across GIL, and now (december '08) will be looking to see what Python 2.6 and 3.0 have to offer in that area.

the solution :)

Erlang! Erlang! Erlang!