Feature image showing a collage of the story's stacked movie poster visualisation and sketches of the editorial illustrations.

How a bad Valentine’s Day turned into a Bollywood data project

The team behind our latest story on Bollywood shares how they researched the films’ dataset, used large language models to classify genres, and crafted the visual design for this bespoke longform story.

This story came from a place of personal disappointment.

For Valentine’s Day last year, I had a picture perfect date in mind: A tender Bollywood rom-com, my partner and I sunk into plush theatre seats with popcorn tubs in our hand, and a comforting bowl of noodle-soup after. It was Friday, the day for new movie releases. I presumed there would be a fresh film aimed squarely at couples like us.

We soon realised our only option was Chhaava. Three hours of war sequences, speeches about kingdoms and honour? It was the exact opposite of the mood for the day. We stayed home, and watched something forgettable on television instead. 

The last romantic Hindi film we enjoyed on the big screen was Rocky aur Rani Kii Prem Kahaani in 2023. Since then, every trailer had felt louder, charged with anger, violence and aggression.

A few Google searches later, it became clear we weren’t alone in this letdown. Cinema-goers shared the same worry. Where were this generation’s Raj and Simran?

Google search results showing Reddit links with titles such as "Is the Bollywood romantic/rom-com genre dead?", "Bollywood romcom genre is dead!!!", "Karan Johar Says Love Stories are 'Dead' in Bollywood."
Google search results showing articles such as "Is romance dead in Bollywood" from India Today and "Bollywood romances are dead. This generation doesn't have its own Raj, Rahul, or Simran" from ThePrint India.
Screenshots from the Google search.
Screenshot of a Reddit comment that says,Screenshot of a Reddit comment that says, So true. All we see is action or thrillers. WE NEED ROMCOMS. Bollywood used to thrive on them once upon a time. Bring the era back. Crying emoji.
Screenshot of a Reddit comment that says,Screenshot of a Reddit comment that says, i feel like it's more of a sincerity issue rather than a creativity one. romcoms we used to have had an earnestness with which the actors played their part. we (i) would fully be convinced that, yes, ofc you cross oceans and go through hell for your one true love. but you don't see that genuine spark anymore. feels very performative. Someone replied to this comment saying, THIS THIS THIS!!
Screenshot of a Reddit comment that says, It is not the romance genre that had a downfall. The entire Bollywood quality has dropped. Even though they are trying to make action movies, they are just remakes or sequels of cop universes or spyverses.
Screenshot of a Reddit comment that says, I miss the yrf movies of 2005-2015 when Imran Khan, Ayushman, Anushka, Parineeti, Alia were working in light hearted movies with amazing original soundtracks.
Screenshots from this Reddit thread.

PVR Cinemas filled this absence with re-releases of old love stories. My social media feed was full of reels with nostalgic viewers dancing in cinema halls to songs from romantic movies like Jab We Met and Yeh Jawaani Hai Deewani. 

I did the most obvious thing a disappointed data journalist does. To verify my hunch, I opened a spreadsheet. There was a long romantic drought to be measured. It was time to analyse my heartbreak. 

The project was simple, at least on paper. Take a dataset of popular films and their genres, plot the trend, and see whether romance has receded. If my hypothesis was correct, I would feel very sad, and a little vindicated.

But movies are not one-dimensional and genres can be slippery. A single film can have a love triangle, explosive car chases and intense family drama, all at once. Every attempt to force them into neat buckets threw off my method in new ways. For a long time before pitching the story, I was circling in a loop: intuition, analysis, confusion, and back to the beginning.

Step 1: Pick the hits

To populate the spreadsheet with films that drew crowds, made money, and shaped popular culture, I relied on Box Office India’s domestic revenue figures for each movie. I wanted to capture demand by measuring audience turnout, and supply by examining what producers were greenlighting. Footfall and filmographies became key proxies to capture both sides. Most of this step was straightforward, except for the pandemic years when cinema halls were shut, leaving gaps in the dataset. Instead of excluding those years entirely, I limited the sample to the top 10-earning films each year. This captured a large share of annual box office revenue and allowed for consistent comparisons over time. The final dataset included 350 commercially important films released between 1990 and 2024, with revenue figures I could trust.

Step 2: Label the dominant flavour

How do I definitively know that a movie is in the romance genre? 

In India, we have a term called a “masala movie”, a literal “blend of spices” for genres. In a masala film, filmmakers throw in everything but the kitchen sink: a righteous hero single-handedly thrashes 10 armed goons, fights corruption, finds his long-lost twin brother, reconciles his family, falls in love between an explosion scene, and moves to Switzerland for a dance sequence. There is something for everyone in this cinematic curry. No one leaves hungry.  

My first instinct was to trust ready-made genre tags crowdsourced from users worldwide on sites like IMDb and Rotten Tomatoes. Even if the labels were accurate, most films carried multiple tags, and counting them all at once meant double, triple-counting or even more. At that point, the chart stopped saying anything meaningful. This approach fell apart quickly.

A snippet of the story's early dataset featuring 6 films that carried 2 to 8 genre tags. One notable example is Tanhaji from 2020, which is tagged Action Epic, Epic, Historical Epic, Period Drama, Action, Biography, Drama, and History.

A few days later, I tried a more statistical route. I scraped plot summaries from Wikipedia and fed them into topic models to uncover dominant themes. Think of topic models as a text detective. They look for words that frequently appear together in movie plots, such as “love”, “heartbreak”, “marriage” for romance, or “explosions”, “chase”, “villain” for action, “alien”, “spaceship” for sci-fi, and reveal genres as main themes. Along the way, I discovered thesis work and research papers devoted to predicting movie genres for Hollywood films, using sophisticated machine-learning algorithms like K-Nearest Neighbours, the Latent Dirichlet Allocation, and more. The genre tangle was a problem many researchers had been chipping away at for years to achieve with high accuracy.

I stood on the shoulders of giants, raiding Stack Overflow and open-source GitHub repositories to reproduce these methods for my sample. But it worked only as well as the quality and availability of Wikipedia plot summaries. The results were noisy, often inaccurate, and were making me miss the forest for the trees. I abandoned this approach.

A snippet of the story's earlier dataset showing 3 movies and their plot column, which all contain extremely long plot summaries.

Even if the dish was high on masala, I had to be reductionist and isolate the strongest flavour. So instead of asking what genres does this film contain? I asked a narrower question: What is this movie primarily about? This was an editorial judgment.

Four early tests of radar chart visualisation for 'Masala' movies, which contain several genres. The radar chart assigns one movie to one hexagon. Each point is one genre, and coloured lines will point to the genres that apply to that movie.
Early tests of a radar chart visualisation. This was scrapped later on...
A test of the radar chart visualisation portrayed as a masala box, which is a 6-compartment box used to contain spices. Instead of coloured lines pointing to a genre, that genre's "compartment" will be filled with spices.
...But this is what it would have looked like, if we went with a multi-genre approach and spiced it up with a masala box found in every Indian kitchen, as a visual metaphor!

Take the film Om Shanti Om, for example. It has reincarnation and ghosts (fantasy), daring fire rescue scenes (action), self-aware parody of the film industry (comedy), a revenge plot (thriller), over-the-top monologues and plenty of melodrama. But at its heart, it’s a love story between Om and Shanti, twice over in one movie. If you were describing it to a friend in one sentence, you’d probably start with the love story: "A man reborn to win back his lost love".

I was also curious to see, if romance was on the decline, what was on the rise? A new spice had entered the masala mix: “kesar”, Hindi for saffron, a colour associated with rising nationalism in India over the last decade. Kesariya tera ishq hai piya. Saffron was the new colour of love.

Step 3: Let an LLM take the first pass

How do I definitively make a large-language model (LLM) know that a movie is in the romance genre?

A recent story in The Pudding used an LLM prompt to classify love songs. If a model could detect romantic intent in lyrics, perhaps it could do the same for movies. It was tempting to give it a try, and it did turn out to be a useful starting point.

With a well-written prompt, I could cover a lot of ground quickly. With a bad one, its guess was no better than rolling a dice. The results depended heavily on how I framed the prompt and how much information was available online for it to crawl.

I struggled with pinning down a definition. What makes a film a Bollywood romance, or even trickier, a rom-com? I googled “How to determine whether a movie is or isn't a rom-com” and found an article with the exact same title on Entertainment Weekly, published in 2019. Before offering answers, it offered a disclaimer: “There's no single, universally accepted definition for what actually qualifies as a rom-com. [...] ‘I can't describe it, but I know it when I see it.’”

It sounded like a gut-check. I was ready to live with some level of subjectivity as long as the primary genre was in a reasonable range and didn’t completely flip when sorting movies into six broad genres.

After many rounds of refinement, I asked Claude to make binary judgments: Is this film primarily a romance, yes or no? Is nationalism a central theme, yes or no?

Look at Kabhi Khushi Kabhie Gham for example. Does the film contain a love story? Yes. But based on the plot, the movie is largely a joint family soap opera. As long as it didn't flip to action or thriller, the accuracy held for me. The whole was supposed to be greater than the sum of its parts.

A snippet of the story's dataset showing a sample of all the colums: Movie title, year, plot summary, key cast, themes, visual style, narrative structure, and primary genre.
A snippet of the story's dataset used for the data visualisation on the rise of nationalistic sentiments, showing columns rank, movie title, year, justification, primary genre, imdb link, and national sentiment.

Over the next few weeks, I experimented with multiple chart types using the genre-labelled dataset in R, but kept running into limits with story design and front-end development. It was at this point that I pitched the story to Kontinentalist.

A snippet of the pitch deck shown to Kontinentalist, showing a stacked bar chart of movie posters from 1970 to 2024.
A snippet of the pitch deck shown to Kontinentalist, showing a prototype of a scatterplot chart of movie footfall. Accompanying text says "OTT platforms have changed what audiences are willing to step out to watch. Competition from pan-Indian release of South Indian films.
A snippet of the pitch deck shown to Kontinentalist, showing a collage of recent action posters such as Pathaan, Animal, Jawan, and Adipurush. The collage is captioned "Bollywood today". All posters have a very gritty, blue-yellow colour scheme and angry men with weapons as the central characters.
A snippet of the pitch deck shown to Kontinentalist, showing a collage of older romance film posters such as Yeh Jawaani hai Deewani, Khoobsurat, and Ram-leela. The collage is captioned "Bollywood in the 1990s and 2000s". All the posters feature happy couples dancing, laughing, or embracing each other, and the colour scheme is warm and bright.
Some slides from the story's pitch deck.

Step 4: Having a human in the loop

The LLM was a fast solution. It was able to scan information across the Internet in seconds. It wasn’t always wise though. I kept revising the prompt trying to guide it. Some movies proved to be super tricky.

A man falls for a girl who is a robot. The model would see “robot” and label it sci-fi, missing that the emotional spine of the film is romance. Two people fall in love while robbing a bank. Does that make it a crime film or a love story? A film with three friends going on a road trip, is full of drama about their friendship, even if they find love on the trip. As long as it was classified as drama or romance, it fell within the guardrails I had set. Then, there were genre-benders, where love stories morphed into thriller or action films.  For many such edge cases, I rewatched trailers to audit the verdict. A human had to remain in the loop to catch these nuances.

For older movies, where online information was scarce, I had a source better than the internet. My mother, who has never lost an “antakshari, a game where players sing filmy songs starting with the last consonant of the previous song, in her life (so far).

“Ma, what is the main genre of Bobby, released in 1973?”

“It’s a classic love story. That song... Main Shaayar Toh Nahin is from that film.” And a scene-by-scene commentary followed.

“How do you know it’s not just a family drama between two people from different social backgrounds?”

“Lead pair, the lovers, Rishi Kapoor and Dimple Kapadia. They were projected as such and launched together in that film. The film is about them falling in love. The families, the class tension, all of that is peripheral.”

By the end, one thing stood out. Love was being swapped in the spice box by artificial ingredients meant to increase shelf-life and alter what the audiences crave. And that’s the story I set out to tell.

BTS

Comments