Return to Blog Home

Tropesaurus

Source 

Tropesaurus

Trope-powered similar-finding work engine and more!

Nothing simultunasly great and sad then completing a work - Movie, TV Show, Book, etc - and looking for something similar to it, only to find arbitray recommendations - if only there were something data-backed...

Well that's where Tropesaurus comes in, allowing you to find similar works based on the tropes they share, and more!

TV Tropes

A great site that collects, lists, and describes - in a sometimes terrifying amount of detail - the tropes that are used in various works, and generally which tropes those works contain.

Unfortunatly there is no API for this great site, but thankfully isn't not using some complex frontend framework that makes it impossible to scrape, so that is exactly then plan.


Using the page-listing endpoint:

https://tvtropes.org/pmwiki/pagelist_having_pagetype_in_namespace.php?n=?t=work

to find all the works of a certain category, it was only writing the script to scrape each individual work page for the list of tropes it contained that was the hard part - as being user-contributed pages the format while somewhat uniform, did vary based on how many tropes the work contained.

Data

Considering the scale of the data, only what was needed was collected:

and for each trope:

Similarity

The primary feature of Tropesaurus is to find similar works based on the tropes they share, thankfully this isn't too difficult when using a MongoDB aggregation pipeline:

const similar = await Work.aggregate([
  {
    $match: {
      $and: [
        { _id: { $ne: work._id } },
        { tropes: { $in: work.tropes } }
      ]
    }
  },
  {
    $addFields: {
      similarity: { $size: { $setIntersection: [work.tropes, "$tropes"] } }
    }
  }
]).sort({ similarity: -1 });

First we quickly exclude the original work and any work that doesn't share the tropes we are looking for, then we add a new field to each work that is the size of the intersection of the tropes of the original work and the current work, and finally we sort the works by the size of the intersection - additionally we could add a limit to the aggregation pipeline to only return the top 10 or so works, but considering the static nature of the data, this is not needed and if queries are slow, the data can be pre-computed and stored in a separate collection or such.

Extras

Of course with all this data more could be done, from listing a work and all it's tropes, to a trope and all it's works, showing tropes by how many works they are in, searching for works that must have and can't have certain tropes, taking two works and displaying the tropes they share and don't share, and more!

...and all that data we didn't collect that the user might want to see? All available to the user both by linking back to the original page for the work/trope, but also by embeding an iframe of the page in question.