Which Rapper Has The Largest Vocabulary in Hip Hop (an interesting study)

Options
lazypakman
lazypakman Members Posts: 4,913 ✭✭✭✭✭
edited May 2014 in The Reason
http://rappers.mdaniels.com.s3-website-us-east-1.amazonaws.com/
Literary elites love to rep Shakespeare’s vocabulary: across his entire corpus, he uses 28,829 words, suggesting he knew over 100,000 words and arguably had the largest vocabulary, ever.

I decided to compare this data point against the most famous artists in hip hop. I used each artist’s first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake.

35,000 words covers 3-5 studio albums and EPs. I included mixtapes if the artist was just short of the 35,000 words. Quite a few rappers don’t have enough official material to be included (e.g., Biggie, Kendrick Lamar). As a benchmark, I included data points for Shakespeare and Herman Melville, using the same approach (35,000 words across several plays for Shakespeare, first 35,000 of Moby ? ).

I used a research methodology called token analysis to determine each artist’s vocabulary. Each word is counted once, so pimps, ? , pimping, and pimpin are four unique words. To avoid issues with apostrophes (e.g., pimpin’ vs. pimpin), they’re removed from the dataset. It still isn’t perfect. Hip hop is full of slang that is hard to transcribe (e.g., shorty vs. shawty), compound words (e.g., king ? ), featured vocalists, and repetitive choruses.

It’s still directionally interesting. Of the 85 artists in the dataset, let’s take a look at who is on top.

1425ys4.jpg


#1

Aesop Rock


When I first published this analysis, I excluded Aesop Rock, figuring he was too obscure. The Reddit hip hop community was in uproar, claiming Aesop would absolutely be #1. Sure enough, Aesop Rock is well-above every artist in my dataset and I was obliged to add him to the chart. In fact, his datapoint is so far to the right that he should be off the chart (I'm lazy and didn't adjust the scale).



#2, #6, #7, #9, #20, and #23

wu-tang clan aint nothin ta ? wit


Wu-Tang Clan at #5 is ? impressive given that 10 members, with vastly different styles, are equally contributing lyrics. Add the fact that GZA, Ghostface, Raekwon, and Method Man's solo works are also in the top 20 – notably, GZA at #2. Perhaps their countless hours of studio time together (and RZA’s mentorship) exposed each rapper’s vocabulary to one another.

Let’s take a deeper look at Wu-Tang five studio albums to better understand each member’s contribution. Here's a breakdown of the number and percent of words used by each member.

wu-tang-breakdown.png

To understand each rapper's vocabulary (# of unique words) in Wu-Tang's first five albums, I chose a 3,500 word threshold so that each person was on an equal footing. That way, we could include GZA, but unfortunately had to exclude Ol' ? Bastard, Cappadonna, and Masta Killa, who have too few verses across Wu-Tang's corpus.

wu-tang-graph3.png

U-? and GZA clearly bolster the group’s average. Raekwon and Method Man’s contributions have a lower average compared to other members, but recognize that their data points would exceed most artists in hip hop.
«1

Comments

  • lazypakman
    lazypakman Members Posts: 4,913 ✭✭✭✭✭
    edited May 2014
    Options
    #3 - 5

    Kool Keith, Canibus, Cunninlynguists


    Moving past Wu-Tang’s dominance, the next three artists are relatively not as well-known. Of the three, Kool Keith has the most diverse vocabulary. For a taste of his work, check out his album with the largest vocab: Dr. Octagonecologyst. #2 and #3 are two relatively underground (yet accomplished) acts: Jamaican-born rapper Canibus and southern-based group CunninLyguists.


    #14 - 15

    Outkast and E-40


    Of course E-40 is in the top 20; he’s considered to be the inventor of much slang. Just a few that he’s been responsible for: all good, pop ya collar, shizzle, and you feel me.

    At #15, Outkast’s deep vocabulary is definitely a function of their style: frequent use of portmanteau (e.g., ATLiens, Stankonia), southern drawl (e.g., nahmsayin, ery’day), and made-up slang (e.g., flawsky-wawsky).

    As expected, other southern-based acts aren’t in Outkast’s league. Take a look at the regional break-out below:

    regional.png

    The south has the lowest average (4,268) and the east-coast the highest (4,804). In fact, only 4 of the 17 southern-based artists in the dataset are above average. My guess is that this is a function of crunk music's call-and-response style, resulting in more repetition of words.

    #26 and #33

    Busta Rhymes and Twista


    Since both rappers are known for their speed, it’s nice to see that their verses are just as lyrically diverse as their peers.

    And skipping ahead to the bottom of the dataset...

    #67, #68, #71, and #72

    snoop dogg, 2pac, Kanye west, and lil wayne


    Some of the biggest names in hip hop were in the bottom 20%. Let’s take another look at the data:

    dist2.png

    #85

    DMX

    At #85 and in last place: DMX. But this shouldn't undermine an artist whose raw energy and honesty were the most memorable qualities of his music.
  • Busta Carmichael
    Busta Carmichael Members, Moderators Posts: 13,161 Regulator
    Options
    I think this shows how much an image or crazy style can have an impact on a rapper.

    Because Masta killa had more rhymes than ODB but ODB is more rememberable.
  • Karl.
    Karl. Members Posts: 8,015 ✭✭✭✭✭
    Options
    Thought of Gza when I saw the thread title.
  • lazypakman
    lazypakman Members Posts: 4,913 ✭✭✭✭✭
    Options
    the east coast though.

    leagues above.
  • JerfyT
    JerfyT Members Posts: 1,163 ✭✭✭✭
    Options
    It was people that had at least 3 albums, I know that. That's why Biggie or Kendrick Lamar aren't included
  • 5 Grand
    5 Grand Members Posts: 12,869 ✭✭✭✭✭
    Options
    JerfyT wrote: »
    It was people that had at least 3 albums, I know that. That's why Biggie or Kendrick Lamar aren't included
    Really, Life After Death was a double album. Big didn't have 3,500 words? I guess it was because of the features. What if you included that Best of Biggie mixtape that Mr Cee put out? I wonder if that would add up to 3,500?
  • lazypakman
    lazypakman Members Posts: 4,913 ✭✭✭✭✭
    Options
    who would have put u-? above gza though?

    with the amount of random gibberish that comes out of ghostface's mouth (just use one from 'nutmeg' as a reference) i don't even know what language he is speaking in at times.
  • 5 Grand
    5 Grand Members Posts: 12,869 ✭✭✭✭✭
    Options
    lazypakman wrote: »
    who would have put u-? above gza though?

    with the amount of random gibberish that comes out of ghostface's mouth (just use one from 'nutmeg' as a reference) i don't even know what language he is speaking in at times.

    Yeah, I feel the same way about Kool Keith. His lyrics don't make any sense but he's in the top 10 and Pac is at #68. That just goes to show that the number of words you use doesn't really mean much, although sometimes Pac comes off a little simplistic.

    Another way to look at it is that Pac says more with less words.
  • lazypakman
    lazypakman Members Posts: 4,913 ✭✭✭✭✭
    Options
    I'd also argue with the majority of artist's who start relatively young (like pac and wayne) it's harder to really gauge the vocabulary just based off their initial works since their catalogue is so deep and they evolved much more over the course of their careers.

    they became more complex and lyrically aware as men and as artists as time went on.
  • 5 Grand
    5 Grand Members Posts: 12,869 ✭✭✭✭✭
    Options
    lazypakman wrote: »
    I'd also argue with the majority of artist's who start relatively young (like pac and wayne) it's harder to really gauge the vocabulary just based off their initial works since their catalogue is so deep and they evolved much more over the course of their careers.

    they became more complex and lyrically aware as men and as artists as time went on.

    Good point. I know Chuck D and GURU had college degrees before they started rapping. I'd expect them to have a broader vocabulary than people who were teenagers when they started
  • Valentinez A. Kaiser
    Valentinez A. Kaiser Members Posts: 9,028 ✭✭✭✭✭
    Options
    4,300 And Above (based of unique words used)

    1. Aesop Rock (7,392)
    2. Gza (6,426)
    3. Kool Keith (6,238)
    4. Canibus (5,991)
    5. Cunninlynguists (5,971)
    6. Rza (5,905)
    7. Wu-Tang Clan (5,895)
    8. Roots (5,803)
    9. Ghostface Killah (5,774)
    10. Killah Priest (5,737)
    11. Blackalicious (5,480)
    12. Kool G Rap (5,394)
    13. Redman (5,331)
    14. Outkast (5,212)
    15. E-40 (5,207)
    16. MF Doom (5,204)
    17. Nas (5,096)
    18. Beastie Boys (5,090)
    19. Das EFX (5,005)
    20. Raekwon (5,001)
    21. Xzibit (4,982)
    22. Common (4,974)
    23. Method Man (4,951)
    24. De La Soul (4,933)
    25. Wale (4,896)
    26. Busta Rhymes (4,839)
    27. Tech n9ne (4,830)
    28. Goodie Mob (4,814)
    29. Ludacris (4,806)
    30. Gang Starr (4,794)
    31. Big Daddy Kane (4,768)
    32. Mobb Deep (4,756)
    33. LL Cool J (4,743)
    34. Twista (4,705)
    35. Talib Kweli (4,703)
    36. Brother Ali (4,700)
    37. Fat Joe (4,686)
    38. A Tribe Called Quest (4,635)
    39. Mos Def (4,630)
    40. Rakim (4,621)
    41. Brand Nubian (4,609)
    42. Tyga (4,601)
    43. KRS-One (4,585)
    44. Cypress Hill (4,568)
    45. Clipse (4,514)
    46. Jay-Z (4,506)
    47. Eminem (4,494)
    48. Public Enemy (4,481)
    49. Lil' Kim (4,474)
    50. Lupe Fiasco (4,439)
    51. Ice t (4,431)
    52. Royce da 5'9 (4,430)
    53. Puff Daddy (4,429)
    54.The Game (4,416)
    55. Nelly (4,413)
    56. Cam'ron (4,406)
    57. Ice Cube (4,371)
    58. Biz Markie (4,313)
  • Mr. Rich Pryor
    Mr. Rich Pryor Members Posts: 1,233 ✭✭✭✭✭
    Options
    gza or rakim got dis
  • lazypakman
    lazypakman Members Posts: 4,913 ✭✭✭✭✭
    Options
    i dare you to stay sleeping on Cunninlynguists.

    best hip hop group of the last decade.
  • RXMasked
    RXMasked Members Posts: 321 ✭✭✭
    Options
    of course the only flaw in it is they picked certain rappers. Like why they put Aesop Rock but no El-P, Camu Tao, Myka 9, Rass Kass, Aceyalone, Kool Moe Dee, Percee P, and etc.? Why they even put groups/duos in there?
  • TWMorris
    TWMorris Members Posts: 24
    Options
    Wow this was interesting to read, I knew artists from Rhymesayers would have a high ranking in this but I didn't expect Aesop Rock to be #1
  • lazypakman
    lazypakman Members Posts: 4,913 ✭✭✭✭✭
    Options
    RXMasked wrote: »
    of course the only flaw in it is they picked certain rappers. Like why they put Aesop Rock but no El-P, Camu Tao, Myka 9, Rass Kass, Aceyalone, Kool Moe Dee, Percee P, and etc.? Why they even put groups/duos in there?

    i'm just amazed at how the dude who did this managed to even put this system together.it could have been more thorough yeah but i'm sitting here looking at the term 'token analysis' trying to figure it out and my mind is blown to ? .

  • 5 Grand
    5 Grand Members Posts: 12,869 ✭✭✭✭✭
    edited May 2014
    Options
    Lol @ Puff Daddy being listed? They should have just listed Sauce Money or Mad Skills.

    I'm surprised Redman is so high. Who would of thought that Redman uses more words than Nas? And Redman is ten slots higher than Method Man.

    And Kool G Rap is 12 and Big Daddy Kane is 31

    I would have expected Rakim to be higher than 40 and certainly higher than The Beastie Boys.

    Nas is ranked higher than Jay. That doesn't suprise me
  • sully
    sully Members, Writer Posts: 4,955 ✭✭✭✭✭
    edited May 2014
    Options
    i don't like this analysis. It's too crude.

    i think a better, more accurate method would've been to catalogue these rappers based on a certain amount of verses spit and not albums (b/c of the differential in songs and albums released) and b/c of the differential b/w solos versus groups.

    Breakdown by number of verses and then break that down into average unique words per verse. W/ a minimum qualification of about 100 verses per artist no less than 16 bars. And maybe instead of using unique words, break it down by words repeated X amount of times (maybe 5 if they're 100 verses to account for a margin of error).

    This way, you'd have a better understanding about who is using the same words every verse and would seem to imply who is talking about different things, since a low repeat rate would indicate a higher amount of topics or concepts touched upon per artist.
  • 5 Grand
    5 Grand Members Posts: 12,869 ✭✭✭✭✭
    Options
    Well it seems like by your 3rd or 4th album you should be using new words, not the same words you used on the previous albums. Thats why Pac got listed so low.
  • BenjaminE
    BenjaminE Members Posts: 3,679 ✭✭✭✭✭
    Options
    RXMasked wrote: »
    of course the only flaw in it is they picked certain rappers. Like why they put Aesop Rock but no El-P, Camu Tao, Myka 9, Rass Kass, Aceyalone, Kool Moe Dee, Percee P, and etc.? Why they even put groups/duos in there?

    I was also surprised there was no el-p but he included aesop, but couldn't be bothered to update the graph... the study seems to be misleading, especially when it's adjusted to have gza near u-? ... also, the amount of vocabulary used doesn't mean the rapper is better... it just means they may not know how to get their point across in a concise manner...
  • sully
    sully Members, Writer Posts: 4,955 ✭✭✭✭✭
    Options
    i'm surprised Lupe is so low. thought he'd be at least near the middle
  • tompetrez3
    tompetrez3 Members Posts: 6,669 ✭✭✭✭✭
    Options
    Like I been telling you Jigga ? for years Jay Z is a BELOW AVERAGE MC. I'm glad this has been proven mathematically. Everyone in my top 5 is above Jay z on this list including Lil Flip.