Update Oct 14.
In keeping with my practice of only passing primitives to DelayedJob to mitigate serialization errors, I made transfer_and_cleanup a static method that accepts a Document ID, rather than an instance method on Document. The new code has been inserted below.
I’ve been in the dev game for about 15 years, and for whatever reason I always seem to end up needing the ability to handle user-generated content, i.e. uploads. Back in the day, I used attachment_fu, then transitioned to Paperclip and have stuck with it ever since. Other than a few sketchy point releases, Paperclip has progressed remarkably well and is a solid, reliable solution for handling uploads.
The Paperclip S3 workflow generally looks like this:
- Server saves the file to a tmp directory and processes it if necessary
- File is transferred to S3 and deleted out of tmp
Makes sense, right?
Except for the timeouts that occur when your users have crap connections, or if dealing with very large files. Then there are the platform considerations: Heroku has an explicit 30 second http timeout, and while Engine Yard is as configurable as you want it to be, I have personally seen unresolvable issues with this workflow when uploading files in the hundreds of megabytes.
Not to mention you’re making your users wait for two uploads to complete. Again, with larger files, this ends up being pretty gross.
So what’s the answer?
Upload files directly to S3, skipping the middleman and all the associated hassles. Easy right? There are many articles out there that detail how to do just this, but they generally stop short of detailing the full upload lifecycle.
For my most recent project, I was already using Paperclip for user avatars, which are small enough to not have to worry about custom workflows, and I wanted to continue using Paperclip for file uploads in order to leverage its attachment handling syntax and callbacks. I’ve seen many folks asking about this, so I thought I’d document my process for the greater good.
Let’s get started!
Our app will work as follows:
- User uploads their file directly to a temporary directory on S3
- A form callback posts the temporary file URL to our app
- Our app creates a new Document object, sets some initial data from the temporary S3 file, then queues a background process to move the temporary file to the location that Paperclip expects it to be and to process thumbnails if required
- Show users a message if they visit a file page while its still being processed
2. S3 Setup
First thing’s first, sign up for Amazon S3. Once you have an account, you’ll need a bucket for each of your environments. The easiest way to do this is from the S3 Management Console. For this post, we’ll be living strictly in the dev bucket. Once you have your bucket, you will need to set up CORS to allow cross origin uploads. From the Properties tab of your bucket, click on “Edit CORS configuration” and paste in the following XML:
Note: You’ll want to set the AllowedOrigin value to your proper site on your production bucket.
3. App Configuration
You’ll need three gems in your gemfile to get rolling:
Once these are bundled, you’ll need to create a yaml file for your AWS keys so that you can interact with your bucket directly, as well as from Paperclip and s3_direct_upload. Your exact setup will depend on your hosting platform and preferences. If I’m using Heroku, I generally pop dev keys right into the yaml file and load production keys from an ENV variable that has been set with heroku:config set.
With the yaml in place, you’ll need to create some configuration files under config:
4. The front end
There’s a wealth on information on setting up, configuring, and styling s3_direct_upload on GitHub. Essentially we need an upload form and an XHR controller action that will receive callbacks from that form.
So what’s all this doing?
The form helper handles the posting of the upload to S3, with support for multiple file uploads via jquery-fileupload. Once a file has been uploaded, a callback is made to POST /documents with some information about the upload, most importantly the S3 key (upload URL). The controller then attempts to create a document based on that key, which brings us to the back end.
5. The back end
The document model (you can name this whatever you want) will handle persistence and management of user uploads. In addition to the standard file attachment columns provided by Paperclip, it will need a processedcolumn, since we’ll be doing some background processing post-upload. Here’s what your migration might look like:
In order to support Paperclip fully, a model must include the following upload attributes:
- file name
- content type
- S3 location
The first three attributes are sent along with the temporary file location whenever a file is posted to /documents, as can be seen from the following post params dump:
Great, right? At this point, you have two options:
- Trust the data
- Ignore the data, and re-query it direct from S3
I always opt for option 2, to prevent any possibility of malicious/incorrect data being saved to the database. This is especially relevant if you’re imposing any specific limitations on your file types, upload bandwidth, etc.
So where does that leave us? When a file is posted, we need to query S3 for the file’s metadata. A single head call gives us the attributes we need. Once those are in place, we queue up a final call that either 1) manually moves the temporary file to the location that Paperclip expects, if there is no post-processing required, or 2) runs a post-process call by setting the upload URI, which will copy down the file from the temporary location, process, and re-upload all of the required styles plus the original. By keeping uploaded files in the locations that Paperclip expects, we can leverage its built in methods for handling downloads, deletes, etc.
Note: I use delayed_job but you can use the background processor of your choice.
You’ll want to show users a friendly message if a file hasn’t been processed yet, as well as restricting download access. Because our files live in the happy place where Paperclip wants them to, everything else is cake. Downloads can be performed with a redirect to
document.destroy will destroy both the Document object and the files on S3.
Being my first tutorial post, its entirely possible I’ve missed something. Please feel free to leave feedback, questions, or “what the heck?!s” in the comments section below. Cheers.