EP1: How Netflix’s MediaFM Understands Movies Using AI

How Netflix’s MediaFM Understands Movies Using AI

Hello there, I love reading and learning in general.

But I know that a lot of people don’t.

Especially when the text is long and tech jargon-heavy.

But reading a company’s blog improves system design skills greatly.

I am starting a new series where I’ll provide simplified explanations of articles that I usually read.

Introducing CTRL+BREAK. Episode 1.

Today I’ll introduce you to this article that I recently read:

https://netflixtechblog.com/mediafm-the-multimodal-ai-foundation-for-media-understanding-at-netflix-e8c28df82e2d

I’ll try to keep it small and simple yet comprehensive.

EP1: How Netflix’s MediaFM Understands Movies Using AI

What Was The Problem?

What do you think is the best way to suggest a movie to someone?

I would say suggesting something I have already watched, based on that person’s interest, he/she would like it.

Now, Netflix very well knows what you like to watch.

But its recommendation system hasn’t seen the movie.

What I mean is that a movie, series, or documentary is not just its metadata.

It’s the audio that gets you immersed in it, the dialogue, and the visuals.

But we as humans see it all. How would an ML algorithm watch a movie?

Some Terms I Didn’t Know (Maybe You Don’t Either)

BERT: A Transformer-based model that understands sequences (like sentences) using context.
MSM (Masked Shot Modeling): Like fill-in-the-blank, but for videos.
AdamW: A standard optimizer used to train deep learning models.
Muon: A newer optimizer that Netflix found performs better for this model.

Why Multimodal + Core Architecture?

Because one type of data is not enough.

Example:

A horror scene might look normal visually
But scary because of the background music

So Netflix combines 3 things:

Video → what’s happening
Audio → how it feels
Text → what’s being said

That’s the core of MediaFM.

What Do Shots (Input) Go Through?

Netflix refers to each movie or series episode as a title in this article.

Feeding a 2–3 hour title into a model isn’t a smart software decision.

That’s why the titles are broken down into shots.

A shot is a small segment of a title (between two cuts).

Here’s how the model works:

Break title → into shots
For each shot, extract:
- Video embedding
- Audio embedding
- Text embedding
Combine them
- They create a single vector (2304 dimensions)
Add sequence (VERY IMPORTANT)
- Shots are not random.
- They come in order: Shot1 → Shot2 → Shot3 → …
Pass it through the Transformer like BERT, but for scenes.
- It understands:
  - What came before
  - What comes next
  - How scenes connect
Result:
- Each shot now has context-aware meaning

Why Have Embeddings As Result And Not Generation?

Do you know the concepts of reusability and DRY (Don’t Repeat Yourself)?

This is exactly the kind of problem they exist to solve.

If we directly generated text as the output, it wouldn’t integrate well with existing systems, and using its results as a common ground would be hard.

That’s why the results are in embeddings (like we keep API data in JSON so that it’s easy to be used by multiple systems)

How Did They Test It?

They first froze the model, meaning they didn’t fine-tune the MediaFM layer.

This allows them to create an additional layer based on the use case and then test the model.

Because if we need to refine this model again and again for various use cases, we would be violating basic efficiency principles and wasting compute, storage, and energy.

Then they tested this on various existing business use-cases.

They used proper evaluation metrics like:

Average Precision (AP) → for classification (Are predictions correct?)
Kendall’s Tau → for ranking (Are rankings aligned with real user behavior?)

They didn’t just test in isolation.

They compared MediaFM against:

Internal model (SeqCLIP)
External APIs (Google, TwelveLabs)

MediaFM performed better across all tasks

Ablation Study (Most Interesting Part)

This is where it gets really insightful.

What is ablation?

Remove one component and see what breaks.

They tested:

Case 1:

Only combine audio + video + text
(no context)

Case 2:

Add Transformer (context understanding)

Result:

Just combining modalities → small improvement
Adding context → huge improvement

The biggest improvement came from:

Understanding context (sequence), not just combining data

Final Thought

If I simplify this entire article into one line:

Netflix is building a system that watches movies like a human — but at massive scale.