Compiling Ideas
Compiling Ideas Podcast
Why Uploading to S3 Isn’t Enough: The Evolution of Large File Transfer Architecture
0:00
-20:34

Why Uploading to S3 Isn’t Enough: The Evolution of Large File Transfer Architecture

When Hugging Face’s XET team analyzed 8.2 million upload requests transferring 130.8 TB in a single day, they discovered that basic S3 uploads couldn’t cut it anymore. This article walks through the architectural evolution from simple blob storage to sophisticated content-addressed systems, showing why companies like Hugging Face, Dropbox, and YouTube all converged on similar patterns: CDNs for global distribution, chunking for reliability, and smart deduplication for efficiency. You’ll learn why the “obvious” solution is never enough when you’re moving terabytes across continents.

Check out the full article at medium.

Introduction

Here’s the thing nobody tells you about cloud storage: uploading a file to S3 is easy. Uploading 131GB of model weights from Singapore to Virginia at 3 AM when your internet decides to hiccup? That’s a completely different problem.

Hugging Face learned this the hard way. They’re running one of the largest collections of ML models and datasets in the world, with uploads streaming in from 88 countries. Meta’s Llama 3 70B model alone weighs 131GB, split across 30 files because nobody wants to babysit a single file upload for two hours. And here’s the kicker: their infrastructure was starting to crack under the pressure.

The XET team (Hugging Face’s infrastructure wizards) sat down with 24 hours of upload data. 8.2 million requests. 130.8 TB transferred. Traffic from everywhere: California at breakfast, Frankfurt at lunch, Seoul at midnight. And their current setup, S3 with CloudFront CDN, was hitting a wall. CloudFront has a 50GB file size limit. S3 Transfer Acceleration helps, but it doesn’t solve the fundamental problem: you’re still treating files like opaque blobs.

This is the same wall that Dropbox hit when syncing became a bottleneck. The same wall YouTube crashed into when 50GB raw video uploads kept timing out. The pattern repeats because the physics don’t change: large files + unreliable networks + global users = you need a better architecture.

Let me show you how they solved it.

The Naive Approach: Just Use S3 (And Watch It Burn)

When you’re starting out, the S3 solution looks perfect. User uploads a file, you generate a presigned URL, they POST directly to S3, boom, done. Dropbox started this way. YouTube did too. Everyone does.

Here’s what that looks like in practice:

This works great until it doesn’t. And it stops working the moment any of these things happen:

The timeout problem. Let’s do the math. You’ve got a 50GB file and a 100 Mbps connection (which is actually pretty good). That’s 50GB × 8 bits/byte ÷ 100 Mbps = 4,000 seconds. Divide by 3,600 and you get 1.11 hours. Your API Gateway times out at 30 seconds. Your web server gives up after 2 minutes. The user’s browser shows a spinning wheel for over an hour with zero feedback. One hiccup in the connection and the entire upload fails.

The size ceiling. CloudFront, which you’re probably using for downloads, caps out at 30GB for single file delivery [1]. API Gateway? 10MB payload limit, non-negotiable [2]. Even if you bypass the gateway and go straight to S3, you’re asking users to upload massive files over a single HTTP connection. That’s fragile as hell.

The geography problem. Hugging Face’s S3 bucket sits in us-east-1 (Virginia). When someone in Singapore uploads a 10GB dataset, that data is traveling 9,000 miles. Every packet, every retry, every byte. There’s no caching on uploads. No edge acceleration that actually helps. It’s just your file crawling across the Pacific.

Dropbox hit this exact issue early on. They had users uploading multi-gigabyte folders, watching progress bars freeze, then having to restart from scratch. YouTube’s story was even worse because video files are huge by nature. A 4K raw video shoot can easily be 100GB+, and filmmakers don’t have patience for “please try again” error messages.

The fundamental problem: you’re treating the network like it’s reliable and the file like it’s atomic. It’s neither.

So what’s the first fix? Bring the data closer to the user.

CDNs: Moving the Goalpost (But Only Halfway)

Content Delivery Networks sound like magic. You put your files in S3, flip on CloudFront, and suddenly users worldwide get fast downloads because the CDN caches files at 400+ edge locations. Someone in Tokyo requests a file? It’s served from Tokyo, not Virginia. Latency drops from 200ms to 20ms. Problem solved, right?

For downloads, absolutely. This is why YouTube doesn’t melt when a viral video gets 10 million views in an hour. The video chunks get cached at edge locations. The origin server (S3 or YouTube’s equivalent) only gets hit once per region. After that, it’s all edge servers doing the work [3].

Hugging Face was already using CloudFront for downloads, and it worked beautifully. Cached model weights, fast retrieval, global coverage. Perfect.

But here’s the catch: CDNs are optimized for reads, not writes.

When you upload a file, there’s no edge caching helping you. Your 50GB model still has to travel from Singapore to us-east-1. The CDN doesn’t intercept it, compress it, or cache it. It just sits there watching the upload lumber across the ocean.

Even worse, CDNs have limits. CloudFront caps files at 50GB. For Llama 3’s 131GB model, you’re already out of luck. You have to chunk it into smaller pieces just to stay under the limit. And chunking introduces a whole new set of problems: tracking which chunks uploaded, handling retries, reassembling them on the other end.

YouTube ran into this hard. They needed users to upload massive video files, but a single upload stream was both fragile and slow. Dropbox faced the same issue with large folder syncs. The realization they all came to? You can’t solve uploads with CDNs alone. You need to rethink the upload path entirely.

Hugging Face’s XET team looked at their traffic patterns and made a key decision: instead of trying to cache uploads at the edge, they would insert a Content-Addressed Storage (CAS) layer between the client and S3. This layer would be geographically distributed, but unlike a CDN, it would be smart about uploads.

But before they could do that, they needed to solve an even more fundamental problem: how do you reliably upload files that are too big to send in one piece?

Chunking: Breaking the File Size Barrier (And Your Sanity)

Here’s the thing about large files: they don’t fit in HTTP requests. Not really. Sure, you can technically POST a 100GB file, but the moment anything goes wrong, you’re starting over. And something always goes wrong.

So you chunk it. You break the file into bite-sized pieces (5–10MB each), upload them separately, and reassemble on the server. This is how Dropbox, YouTube, and now Hugging Face (altough they chunk to 20gb) all handle large files. It’s not optional. It’s the only way this works.

The chunking approach looks like this:

This solves multiple problems at once:

  • Resumability: Your connection drops at 80% uploaded? No problem. Reconnect, ask the server which chunks it has, and upload the rest. Dropbox nailed this early because syncing a 50GB folder over flaky WiFi is basically impossible without resumable uploads [4].

  • Parallelization: You’ve got 100 Mbps of bandwidth and a 50GB file? Don’t send it sequentially. Break it into 100 chunks and upload 10 at a time. Suddenly you’re maxing out your bandwidth instead of babysitting a single slow connection. YouTube’s resumable upload protocol uses this exact approach, requiring chunk sizes to be multiples of 256 KB [5].

  • Progress tracking: Users can actually see what’s happening. “Uploading chunk 47 of 100” is way better than a frozen progress bar. This is basic UX, but it only works if you chunk.

  • Deduplication: Here’s where it gets interesting. If you fingerprint chunks, you can detect when someone uploads the same data twice. Maybe two users upload the same base model with different fine-tuning. The base model chunks are identical. You store them once, reference them twice. Hugging Face saves terabytes this way.

But chunking isn’t free. You’ve introduced a ton of complexity.

You need a metadata database tracking every chunk’s status. Hugging Face uses something like this:

{
  “fileId”: “sha256-abc123...”,
  “chunks”: [
    { “id”: “chunk-1”, “status”: “uploaded”, “etag”: “xyz” },
    { “id”: “chunk-2”, “status”: “uploading” },
    { “id”: “chunk-3”, “status”: “not-uploaded” }
  ]
}

You need chunk validation. Clients can lie. S3 doesn’t send notifications for individual multipart chunks, only the completed object. So you use ETags: each chunk gets one, the client sends it to your backend, and you verify it with S3’s ListParts API. Trust, but verify.

You need reassembly logic. Once all chunks are uploaded, you stitch them together (or in S3’s case, complete the multipart upload). If one chunk is corrupt, you retry just that chunk.

Dropbox went deep on this because sync is their entire business. YouTube needed it for video uploads. Hugging Face needed it for model weights. The pattern is universal: large files require chunking, and chunking requires infrastructure.

But here’s the thing: chunking alone still doesn’t solve the core problem Hugging Face faced. They were moving 130.8 TB per day, and a huge amount of that data was redundant. Different users uploading slight variations of the same model. Same base weights, different adapters. Same datasets, different splits.

They needed more than chunking. They needed content-addressed storage.

Content-Addressed Storage: Stop Uploading the Same Bytes Twice

Let’s talk about waste. Hugging Face noticed something weird in their upload logs: the same data kept showing up. Not identical files, but identical chunks within different files. Two users fine-tune Llama 3, and 90% of the model weights are the same. Why upload them twice?

This is where content-addressed storage (CAS) comes in. Instead of storing files by name (`model-v2.safetensors`), you store them by content hash. If two files have the same bytes, they have the same hash, and you store them once. This approach, pioneered by Git and other distributed systems, uses Merkle trees to enable efficient deduplication while maintaining data integrity [8].

Here’s the magic trick: When a user uploads a file, you chunk it and fingerprint each chunk. Before uploading anything, you ask the CAS: “Do you already have chunk sha256-xyz?” If yes, you skip it. If no, you upload it. Then you store a manifest that maps the file to its chunks.

File “llama3-finetuned.safetensors” = [chunk-A, chunk-B, chunk-C]
File “llama3-base.safetensors” = [chunk-A, chunk-B, chunk-D]

Both files share chunk-A and chunk-B. You store those chunks once, reference them twice. The storage savings are massive.

Dropbox pioneered this for file syncing using a combination of weak rolling checksums (Adler-32) and strong hashing (SHA-256) with Rabin fingerprinting [6]. If you have the same file on your desktop and laptop, Dropbox doesn’t upload it twice. It fingerprints, checks the hash, and skips the duplicate. The technique builds on the classic rsync algorithm [7], which only transfers binary diffs of files. YouTube uses a variation for video uploads: if you upload the same clip twice (maybe re-rendering with different settings), they deduplicate the identical frames.

Hugging Face took this further. They built a CAS layer that sits between clients and S3:

This is the “smart writes” part of Hugging Face’s philosophy. Uploads are expensive. You want to do as little work as possible. So you analyze at the byte level, find what’s already there, and transfer only the delta.

But here’s the tricky part: CAS makes writes smart, but you can’t let it slow down reads.

Smart Writes, Dumb Reads: The Philosophy That Scales

Hugging Face has a mantra: “dumb reads, smart writes.” It sounds backwards until you think about the traffic patterns.

Reads are the hot path. When Llama 3 drops, millions of users download it. Those downloads need to be fast, simple, and reliable. You can’t afford complexity here. No computation, no reconstruction logic, no waiting. Just serve the bytes.

Writes are the cold path. Uploads happen way less often. Maybe thousands per day versus millions of downloads. You can spend CPU cycles here. Analyze chunks, check hashes, deduplicate, compress. It’s worth it because you only pay the cost once.

This is why Hugging Face’s read path is stupidly simple:

The CAS doesn’t serve the data. It just tells the client where to get it. S3 and CloudFront handle the actual transfer, which is what they’re optimized for. Fast, cached, globally distributed.

The write path is smart. It avoids redundant uploads, validates data, updates indexes. But it’s also slower and more complex. That’s fine because uploads are rare.

YouTube follows the same pattern. Video playback is dumb: fetch segments from CDN, play them. Video upload is smart: transcode, split, analyze, deduplicate, generate manifests. Dropbox too: downloads are just “here’s the file,” but uploads check for existing blocks and sync only the diff.

The key insight: optimize the common case (reads), tolerate complexity in the rare case (writes).

But there’s one more problem. Hugging Face’s users are global. 88 countries, uploading from everywhere. How do you make uploads fast when your CAS is in Virginia and the user is in Seoul?

Geographic Distribution: The 80/20 Rule Strikes Again

Hugging Face had a problem. Their CAS was in us-east-1. Great for American users, terrible for everyone else. Someone in Singapore uploading a 10GB dataset had to send it halfway around the world just to check which chunks already existed. That’s a round-trip penalty before the upload even starts.

The obvious solution: put CAS nodes everywhere. AWS has 34 regions. Scatter CAS servers across all of them, right?

Wrong. Infrastructure is expensive. Each CAS node needs compute, storage, and cross-region sync. Multiply that by 34 regions and you’re burning money. Plus, most regions barely see traffic.

So they looked at the data. 24 hours of uploads, 88 countries. Pareto principle in action [9]:

  • Top 7 countries = 80% of uploaded bytes

  • Top 20 countries = 95% of traffic

The United States alone was a third of all uploads. Europe (UK, Germany, Luxembourg) was another big chunk. Asia had concentrated traffic from Singapore, Hong Kong, Japan, South Korea.

Hugging Face made the call: three CAS regions.

  • us-east-1 (4 nodes): Serves North and South America

  • eu-west-3 (4 nodes): Serves Europe, Middle East, Africa

  • ap-southeast-1 (2 nodes): Serves Asia and Oceania

This covers 95% of traffic with just 10 nodes instead of 34. The 5% edge cases (random uploads from Iceland or Fiji) get routed to the nearest region. Slightly slower, but not enough to matter.

YouTube does the same thing. They don’t transcode videos in every Google data center. They pick strategic regions, distribute the work, and let edge cases eat a bit of latency [10]. Dropbox sync servers are clustered in high-traffic regions, not scattered everywhere. Similarly, Netflix’s Open Connect CDN strategically places cache servers in ISP data centers across 233 locations, with heavy concentration in North America and Europe [11].

The trade-off is latency. Moving from one CAS in Virginia to three regional CAS nodes means some users get a worse first-hop. Americans uploading to us-east-1? Fast. Europeans uploading to eu-west-3? Also fast. But someone in São Paulo might hit us-east-1 with a bit more latency than before.

Hugging Face’s estimates: average bandwidth drops from 48.5 Mbps to 42.5 Mbps (12% slower). But that’s the average. For most users, the experience is better. And the bandwidth loss is offset by deduplication savings. You’re uploading fewer bytes overall because you’re skipping redundant chunks.

The final architecture looks like this:

Reads go through CloudFront (400+ edge locations, blazing fast). Writes go through regional CAS (3 locations, smart deduplication). S3 is the source of truth, sitting in us-east-1 because cross-region replication is cheaper than multi-region writes.

This is the same pattern Dropbox uses: regional sync servers for uploads, CDN for downloads. YouTube: regional upload processing, CDN for playback. The pattern repeats because the physics don’t change.

Conclusion

Here’s what Hugging Face learned, and what Dropbox and YouTube learned before them:

  • Naive S3 uploads don’t scale: They’re fine for small files, but anything over a few gigabytes and you’re going to hit timeouts, size limits, and terrible UX.

  • CDNs solve downloads, not uploads: Edge caching is magic for reads, but writes still have to traverse the full distance. You need a smarter upload path.

  • Chunking is non-negotiable: Large files must be broken into pieces for resumability, parallelization, and progress tracking. The complexity is worth it.

  • Content-addressed storage eliminates waste: Fingerprinting chunks and deduplicating saves massive amounts of bandwidth and storage. If you’re moving terabytes, this isn’t optional.

  • Smart writes, dumb reads: Optimize the common case (downloads) for speed and simplicity. Tolerate complexity in the rare case (uploads) because you can afford to.

  • Geography matters, but you can’t be everywhere: Strategic regional placement based on traffic patterns beats scattering infrastructure everywhere.

The next time you upload a 50GB file and it actually works, remember: someone spent months building this architecture so you don’t have to think about it. That’s the whole point.

And if you’re building this yourself? Start simple. S3 + presigned URLs. Then add chunking when files get big. Then CDN when users go global. Then CAS when you’re drowning in duplicate data. Each layer solves a real problem. Don’t over-engineer early, but know where you’re headed.

Because the pattern is always the same: start naive, hit the wall, evolve. Rinse and repeat until your infrastructure can handle 130.8 TB in a day without breaking a sweat.

References

[1] AWS CloudFront Documentation. “CloudFront File Size Limits.” Amazon Web Services. https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/

[2] AWS API Gateway Documentation. “Amazon API Gateway quotas and important notes.” Amazon Web Services. https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html

[3] Medium. “How the Cloud and CDN Architecture Works for Netflix.” ResellerClub, 2023. https://teamresellerclub.medium.com/how-the-cloud-and-cdn-architecture-works-for-netflix-8f3d17906782

[4] Medium. “Inside Dropbox’s Brain: The Chunking Trick That Lets You Sync Gigabytes in Seconds.” Md Mazaharul Huq, 2024. https://jewelhuq.medium.com/inside-dropboxs-brain-the-chunking-trick-that-lets-you-sync-gigabytes-in-seconds-e62a866bb407

[5] Google Developers. “Resumable Uploads — YouTube Data API.” https://developers.google.com/youtube/v3/guides/using_resumable_upload_protocol

[6] Philipp Heckel. “Minimizing remote storage usage and synchronization time using deduplication and multichunking.” Philipp’s Tech Blog, 2013. https://blog.heckel.io/2013/05/20/minimizing-remote-storage-usage-and-synchronization-time-using-deduplication-and-multichunking-syncany-as-an-example/

[7] Andrew Tridgell and Paul Mackerras. “The rsync algorithm.” Technical Report TR-CS-96–05, Australian National University, 1996.

[8] Wikipedia. “Merkle tree.” https://en.wikipedia.org/wiki/Merkle_tree

[9] Wikipedia. “Pareto principle.” https://en.wikipedia.org/wiki/Pareto_principle

[10] YouTube Blog. “Reimagining video infrastructure to empower YouTube.” 2024. https://blog.youtube/inside-youtube/new-era-video-infrastructure/

[11] Data Center Frontier. “Mapping Netflix: Content Delivery Network Spans 233 Sites.” 2024. https://www.datacenterfrontier.com/cloud/article/11431108/mapping-netflix-content-delivery-network-spans-233-sites

Additional Reading

Research Papers:

  • Rabin, Michael O. “Fingerprinting by Random Polynomials.” Technical Report TR-15–81, Center for Research in Computing Technology, Harvard University, 1981.

  • Wen Xia et al. “FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication.” USENIX Annual Technical Conference (ATC), 2016.

  • Broder, Andrei Z. “Some Applications of Rabin’s Fingerprinting Method.” Sequences II: Methods in Communications, Security, and Computer Science, 1993.

Engineering Blogs:

https://openconnect.netflix.com/

Discussion about this episode

User's avatar

Ready for more?