How a bad Valentine’s Day turned into a Bollywood data project
The team behind our latest story on Bollywood shares how they researched the films’ dataset, used large language models to classify genres, and crafted the visual design for this bespoke longform story.
This story came from a place of personal disappointment.
For Valentine’s Day last year, I had a picture perfect date in mind: A tender Bollywood rom-com, my partner and I sunk into plush theatre seats with popcorn tubs in our hand, and a comforting bowl of noodle-soup after. It was Friday, the day for new movie releases. I presumed there would be a fresh film aimed squarely at couples like us.
We soon realised our only option was Chhaava. Three hours of war sequences, speeches about kingdoms and honour? It was the exact opposite of the mood for the day. We stayed home, and watched something forgettable on television instead.
The last romantic Hindi film we enjoyed on the big screen was Rocky aur Rani Kii Prem Kahaani in 2023. Since then, every trailer had felt louder, charged with anger, violence and aggression.
A few Google searches later, it became clear we weren’t alone in this letdown. Cinema-goers shared the same worry. Where were this generation’s Raj and Simran?






PVR Cinemas filled this absence with re-releases of old love stories. My social media feed was full of reels with nostalgic viewers dancing in cinema halls to songs from romantic movies like Jab We Met and Yeh Jawaani Hai Deewani.
I did the most obvious thing a disappointed data journalist does. To verify my hunch, I opened a spreadsheet. There was a long romantic drought to be measured. It was time to analyse my heartbreak.
The project was simple, at least on paper. Take a dataset of popular films and their genres, plot the trend, and see whether romance has receded. If my hypothesis was correct, I would feel very sad, and a little vindicated.
But movies are not one-dimensional and genres can be slippery. A single film can have a love triangle, explosive car chases and intense family drama, all at once. Every attempt to force them into neat buckets threw off my method in new ways. For a long time before pitching the story, I was circling in a loop: intuition, analysis, confusion, and back to the beginning.
Step 1: Pick the hits
To populate the spreadsheet with films that drew crowds, made money, and shaped popular culture, I relied on Box Office India’s domestic revenue figures for each movie. I wanted to capture demand by measuring audience turnout, and supply by examining what producers were greenlighting. Footfall and filmographies became key proxies to capture both sides. Most of this step was straightforward, except for the pandemic years when cinema halls were shut, leaving gaps in the dataset. Instead of excluding those years entirely, I limited the sample to the top 10-earning films each year. This captured a large share of annual box office revenue and allowed for consistent comparisons over time. The final dataset included 350 commercially important films released between 1990 and 2024, with revenue figures I could trust.
Step 2: Label the dominant flavour
How do I definitively know that a movie is in the romance genre?
In India, we have a term called a “masala movie”, a literal “blend of spices” for genres. In a masala film, filmmakers throw in everything but the kitchen sink: a righteous hero single-handedly thrashes 10 armed goons, fights corruption, finds his long-lost twin brother, reconciles his family, falls in love between an explosion scene, and moves to Switzerland for a dance sequence. There is something for everyone in this cinematic curry. No one leaves hungry.
My first instinct was to trust ready-made genre tags crowdsourced from users worldwide on sites like IMDb and Rotten Tomatoes. Even if the labels were accurate, most films carried multiple tags, and counting them all at once meant double, triple-counting or even more. At that point, the chart stopped saying anything meaningful. This approach fell apart quickly.

A few days later, I tried a more statistical route. I scraped plot summaries from Wikipedia and fed them into topic models to uncover dominant themes. Think of topic models as a text detective. They look for words that frequently appear together in movie plots, such as “love”, “heartbreak”, “marriage” for romance, or “explosions”, “chase”, “villain” for action, “alien”, “spaceship” for sci-fi, and reveal genres as main themes. Along the way, I discovered thesis work and research papers devoted to predicting movie genres for Hollywood films, using sophisticated machine-learning algorithms like K-Nearest Neighbours, the Latent Dirichlet Allocation, and more. The genre tangle was a problem many researchers had been chipping away at for years to achieve with high accuracy.
I stood on the shoulders of giants, raiding Stack Overflow and open-source GitHub repositories to reproduce these methods for my sample. But it worked only as well as the quality and availability of Wikipedia plot summaries. The results were noisy, often inaccurate, and were making me miss the forest for the trees. I abandoned this approach.

Even if the dish was high on masala, I had to be reductionist and isolate the strongest flavour. So instead of asking what genres does this film contain? I asked a narrower question: What is this movie primarily about? This was an editorial judgment.


Take the film Om Shanti Om, for example. It has reincarnation and ghosts (fantasy), daring fire rescue scenes (action), self-aware parody of the film industry (comedy), a revenge plot (thriller), over-the-top monologues and plenty of melodrama. But at its heart, it’s a love story between Om and Shanti, twice over in one movie. If you were describing it to a friend in one sentence, you’d probably start with the love story: "A man reborn to win back his lost love".
I was also curious to see, if romance was on the decline, what was on the rise? A new spice had entered the masala mix: “kesar”, Hindi for saffron, a colour associated with rising nationalism in India over the last decade. Kesariya tera ishq hai piya. Saffron was the new colour of love.
Step 3: Let an LLM take the first pass
How do I definitively make a large-language model (LLM) know that a movie is in the romance genre?
A recent story in The Pudding used an LLM prompt to classify love songs. If a model could detect romantic intent in lyrics, perhaps it could do the same for movies. It was tempting to give it a try, and it did turn out to be a useful starting point.
With a well-written prompt, I could cover a lot of ground quickly. With a bad one, its guess was no better than rolling a dice. The results depended heavily on how I framed the prompt and how much information was available online for it to crawl.
I struggled with pinning down a definition. What makes a film a Bollywood romance, or even trickier, a rom-com? I googled “How to determine whether a movie is or isn't a rom-com” and found an article with the exact same title on Entertainment Weekly, published in 2019. Before offering answers, it offered a disclaimer: “There's no single, universally accepted definition for what actually qualifies as a rom-com. [...] ‘I can't describe it, but I know it when I see it.’”
It sounded like a gut-check. I was ready to live with some level of subjectivity as long as the primary genre was in a reasonable range and didn’t completely flip when sorting movies into six broad genres.
After many rounds of refinement, I asked Claude to make binary judgments: Is this film primarily a romance, yes or no? Is nationalism a central theme, yes or no?
Look at Kabhi Khushi Kabhie Gham for example. Does the film contain a love story? Yes. But based on the plot, the movie is largely a joint family soap opera. As long as it didn't flip to action or thriller, the accuracy held for me. The whole was supposed to be greater than the sum of its parts.


Over the next few weeks, I experimented with multiple chart types using the genre-labelled dataset in R, but kept running into limits with story design and front-end development. It was at this point that I pitched the story to Kontinentalist.




Step 4: Having a human in the loop
The LLM was a fast solution. It was able to scan information across the Internet in seconds. It wasn’t always wise though. I kept revising the prompt trying to guide it. Some movies proved to be super tricky.
A man falls for a girl who is a robot. The model would see “robot” and label it sci-fi, missing that the emotional spine of the film is romance. Two people fall in love while robbing a bank. Does that make it a crime film or a love story? A film with three friends going on a road trip, is full of drama about their friendship, even if they find love on the trip. As long as it was classified as drama or romance, it fell within the guardrails I had set. Then, there were genre-benders, where love stories morphed into thriller or action films. For many such edge cases, I rewatched trailers to audit the verdict. A human had to remain in the loop to catch these nuances.
For older movies, where online information was scarce, I had a source better than the internet. My mother, who has never lost an “antakshari”, a game where players sing filmy songs starting with the last consonant of the previous song, in her life (so far).
“Ma, what is the main genre of Bobby, released in 1973?”
“It’s a classic love story. That song... Main Shaayar Toh Nahin is from that film.” And a scene-by-scene commentary followed.
“How do you know it’s not just a family drama between two people from different social backgrounds?”
“Lead pair, the lovers, Rishi Kapoor and Dimple Kapadia. They were projected as such and launched together in that film. The film is about them falling in love. The families, the class tension, all of that is peripheral.”
By the end, one thing stood out. Love was being swapped in the spice box by artificial ingredients meant to increase shelf-life and alter what the audiences crave. And that’s the story I set out to tell.