The Problem With Plus Signs and S3 Objects

Ever tried downloading files with a plus sign in the filename via S3 object URLs? There’s a catch with those.

S3 Object URLs

You can access public S3 objects via http/s. The url to the objects is typically https://<bucket-name>.s3.amazonaws.com/<path> where path includes the prefix directories.

This is very useful and can allow for even hosting an rpm repository. This type of hosting doesn’t provide directory indexing but if you maintain a static index that doesn’t matter.

A problem with this hosting method is when there are plus signs in the path of a file in the bucket. If a file in the bucket is named “hello+world.txt”, you will typically expect the following url to access that file.

https://micho-test-plus-signs.s3.amazonaws.com/hello+world.txt

Surprisingly, instead you’ll receive an access denied error message page.

s3 object url with unencoded plus

And it’s because of an interesting problem with url encoding spaces.

Spaces, Plus Signs, and URLs

URLs cannot contain spaces. URL encoding normally replaces these with either plus signs or with %20; and there comes the problem. It’s not possible to differentiate between a plus symbol or a space when parsing a URL. If a URL wants to use a plus symbol, it must be explicitly url encoded (replaced with %2B).

https://micho-test-plus-signs.s3.amazonaws.com/hello%2Bworld.txt

So a workaround for the error shown above when using a plus sign in the URL is to replace the plus sign with %2B instead (pre-emptive url encoding).

s3 object url with encoded plus

S3 Objects With Spaces

So you might wonder why Amazon does not just by default handle this when doing S3 object access URLs and the reason for that is because of files with a space in their filenames or prefixes.

If a file in a bucket has a space in the filename (which is a valid POSIX filename character) then those can be accessed with URLs where plus symbols replace the spaces.

s3 object url with plus for space

They can also be accessed by having the spaces url encoded (so replaced with %20).

s3 object url with space for space

So essentially, S3 is doing the right thing with plus symbols and using them to access files in a bucket with spaces in their filename or prefix.

Even if people disagree with this behaviour and think it should work vice-versa (i.e. plus symbol resolves to plus symbol and spaces need to be url encoded as %20) changing it now would result in the breaking of backwards compatibility with S3. If there are two files in a bucket with almost the same filenames (one has a space and the other has a plus symbol), then changing this behaviour would change which file is returned via the S3 object URL.

The Issue With RPM Repos

This all seems fine as long as the encoding is explicitly taken care of but one annoying issue stemming from this is when trying to host an RPM repo using S3 object urls. The repo metadata will point dnf/yum to download the files using not-encoded urls. So this breaks dnf when trying to download an RPM like libstdc++.

https://github.com/rpm-software-management/createrepo_c/issues/215 https://github.com/rpm-software-management/librepo/pull/188 https://bugzilla.redhat.com/show_bug.cgi?id=1817130

There’s been discussion in bug tickets and even some PRs to librepo to handle this but still the issue persists in CentOS today. I don’t think url encoding the repo metadata makes sense (this could probably break the repo depending on how it’s hosted).

So I found if you want to host an RPM repo in S3, there’s another way to do that than using S3 object URLs.

Using S3 Websites Instead of S3 Object URLs

There is an alternative to using S3 object URLs to access S3 objects via http. S3 websites can be enabled for buckets to make them accessible and html files can be viewed directly in the browser.

An interesting thing about using this alternative when it comes to the plus symbol problem is that S3 websites handle them differently than with S3 object URLs. S3 websites will always interpret plus signs in the URLs as plus signs and not spaces (i.e. %2B not %20). This means that you can access all objects via the browser without intervention (as your browser normally encodes spaces into %20 for you).

s3 website url with plus for plus

I find accessing S3 objects via the S3 website significantly more human friendly.

Summary

So in summary, there are two workarounds to accessing S3 objects with plus symbols in the filename or prefix:

  1. url encode the plus signs as %2B when accessing them
  2. use S3 websites instead of S3 object URLs

Comparing S3 Object URLs and S3 Websites

A quick summary of the combinations of spaces, plus symbols and their url encoded counterparts.

In a bucket with two files:

  • hello+world.txt
  • hello world.txt

We would see the following files returned depending on the URL:

URL input S3 Object URL S3 Website URL
http://[..]/hello+world.txt hello world.txt hello+world.txt
http://[..]/hello%2Bworld.txt hello+world.txt hello+world.txt
http://[..]/hello world.txt hello world.txt hello world.txt
http://[..]/hello%20world.txt hello world.txt hello world.txt