When large-scale websites rely on global infrastructure to reliably and efficiently deliver content, Content Delivery Networks (CDNs) play a critical role. Beyond simply caching assets closer to users, CDNs also assist in compressing files, accelerating downloads, and improving user experience. However, under certain conditions, they can inadvertently introduce new problems. One such incident involved incorrectly handled GZIP compression and character sets, leading to corrupted downloads and mojibake (garbled text) in filenames — a phenomenon that challenged developers and operators alike.

TL;DR: A misconfiguration in a CDN service led to it stripping GZIP compression headers from downloadable files and misunderstanding character encoding of filenames. This resulted in downloads with corrupted or unreadable filenames (mojibake). The issue was ultimately resolved by forcing the correct charset in HTTP headers, ensuring that both filename encoding and content were interpreted correctly by the browser. This case highlights the importance of consistency in content encoding, especially when using CDNs that may modify HTTP headers.

What Went Wrong: Compression Mismanagement

At the heart of the issue was the CDN’s inappropriate handling of the Content-Encoding header. The origin server correctly compressed files using GZIP and labeled them with the following header:

Content-Encoding: gzip

However, the CDN — intended to optimize delivery — decided to strip this header and serve the content as if it were uncompressed. This worked fine for browsers expecting raw files like CSS or JavaScript, but when users tried to download files such as CSVs, PDFs, or ZIP archives, they received corrupted downloads. Unzipping such files either failed outright or produced data that seemed unreadable or incomplete.

Beyond binary corruption, an even more mysterious problem emerged: some filenames appeared distorted with strange symbols, particularly when downloaded using browsers like Chrome or Firefox. This phenomenon is known as mojibake, and it occurs when a program interprets a sequence of bytes using an unintended character encoding.

Confusion in Character Encodings

Mojibake in downloaded filenames typically occurs when:

  • The filename contains non-ASCII characters (such as accented letters or Asian scripts)
  • The browser doesn’t know which character set to use
  • The Content-Disposition or Content-Type headers are lacking proper charset declarations

The browser, guessing wrong, tries to interpret the filename using a default or fallback encoding like ISO-8859-1, leading to gibberish in place of legible characters. This usually affects users downloading files with filenames in languages like Japanese, Russian, or German, where special characters are prevalent.

Originally, the developers had set appropriate headers from the application server, such as:

Content-Type: application/octet-stream; charset=utf-8
Content-Disposition: attachment; filename="résumé.pdf"

But, once again, the CDN altered these headers by removing or replacing them, leading to downloads without the charset hint. This triggered incorrect browser behavior as the filename was interpreted with the wrong encoding.

The Fix: Enforcing Charset in HTTP Headers

After much debugging and log tracing, developers confirmed that:

  • The files weren’t corrupted at the origin server.
  • Downloads were successful via curl and direct IP access.
  • The issue only occurred when served through the CDN.

Therefore, the proper solution was twofold:

  1. Force the CDN to preserve Content-Encoding headers so that browsers receive and decompress GZIP content properly.
  2. Set explicit charset on both Content-Type and within Content-Disposition headers to guarantee proper international filename decoding.

The final working header configuration looked like this:

Content-Type: application/octet-stream; charset=utf-8
Content-Disposition: attachment; filename*=UTF-8''r%C3%A9sum%C3%A9.pdf
Content-Encoding: gzip

The use of filename* with UTF-8'' URL-encoding syntax ensures that browsers interpret the filename according to RFC 5987. This is particularly supported in modern browsers, aligning cross-platform behavior.

Why CDNs Alter Headers

CDNs often aim to optimize performance, reduce redundancy, and standardize responses. To this end, they may:

  • Strip or replace compression directives
  • Normalize content types
  • Remove headers that don’t pass security filters or caching rules

However, these optimizations can backfire when they override carefully set parameters critical for content rendering or file downloading. In this incident, the CDN’s failure to preserve the correct Content-Encoding and charset proved detrimental to both usability and internationalization.

Lessons Learned

This issue serves as a valuable reminder for developers working in distributed environments:

  • Always test content delivery end-to-end. Files that work on your server may behave differently behind a CDN.
  • Be explicit in headers. Assume nothing about default behaviors — always declare content type, encoding, and charset.
  • Control CDN behavior through configuration. Most CDNs allow overrides or rules to preserve headers. Utilize them.
  • Verify download behavior in multiple browsers and locales. Internationalization bugs often appear only under these conditions.

FAQ

What is mojibake?

Mojibake is a term used to describe the garbled or incorrect display of characters caused by character encoding mismatches. It often occurs when software misinterprets the character encoding used to store or send text data.

How does gzip affect file downloads?

When used correctly, GZIP compresses files to reduce download time. However, if a file is served as GZIP-compressed while lacking the appropriate Content-Encoding: gzip header, browsers may not decompress it, leading to corrupted or unreadable downloads.

Why would a CDN strip headers like Content-Encoding or charset?

CDNs prioritize performance and security. In doing so, they often normalize headers or apply policies that remove potentially unsafe or unnecessary information. This can inadvertently remove critical metadata needed for correct content handling.

What is the correct way to specify non-ASCII filenames for downloads?

Use the Content-Disposition header with the filename* attribute using UTF-8 encoding and percent-escaped format, as specified in RFC 5987. For example:

Content-Disposition: attachment; filename*=UTF-8''r%C3%A9sum%C3%A9.pdf

How can developers avoid such issues in the future?

They should conduct tests through the CDN layer, specify headers explicitly, and make use of CDN configurations that preserve or pass through all required metadata. Additionally, keeping up documentation on how CDNs alter traffic is essential during debugging phases.