It would seem that I have far too much time on my hands. After the post about a Star Trek “test”, I started wondering if there could be any data to back it up and… well here we go:
The Next Generation
| Name | Percentage of Lines |
|---|---|
| PICARD | 20.16 |
| RIKER | 11.64 |
| DATA | 10.1 |
| LAFORGE | 6.93 |
| WORF | 6.14 |
| TROI | 5.4 |
| CRUSHER | 5.11 |
| WESLEY | 2.32 |
DS9
| Name | Percentage of Lines |
|---|---|
| SISKO | 13.0 |
| KIRA | 8.23 |
| BASHIR | 7.79 |
| O’BRIEN | 7.31 |
| ODO | 7.26 |
| QUARK | 6.98 |
| DAX | 5.73 |
| WORF | 3.18 |
| JAKE | 2.31 |
| GARAK | 2.29 |
| NOG | 2.01 |
| ROM | 1.89 |
| DUKAT | 1.76 |
| EZRI | 1.53 |
Voyager
| Name | Percentage of Lines |
|---|---|
| JANEWAY | 17.7 |
| CHAKOTAY | 8.76 |
| EMH | 8.34 |
| PARIS | 7.63 |
| TUVOK | 6.9 |
| KIM | 6.57 |
| TORRES | 6.45 |
| SEVEN | 6.1 |
| NEELIX | 4.99 |
| KES | 2.06 |
Enterprise
| Name | Percentage of Lines |
|---|---|
| ARCHER | 24.52 |
| T’POL | 13.09 |
| TUCKER | 12.72 |
| REED | 7.34 |
| PHLOX | 5.71 |
| HOSHI | 4.63 |
| TRAVIS | 3.83 |
| SHRAN | 1.26 |
Discovery
Note: This is a limited dataset, as the source site only has transcripts for seasons 1, 2, and 4
| Name | Percentage of Lines |
|---|---|
| BURNHAM | 22.92 |
| SARU | 8.2 |
| BOOK | 6.21 |
| STAMETS | 5.44 |
| TILLY | 5.17 |
| LORCA | 4.99 |
| TARKA | 3.32 |
| TYLER | 3.18 |
| GEORGIOU | 2.96 |
| CULBER | 2.83 |
| RILLAK | 2.17 |
| DETMER | 1.97 |
| OWOSEKUN | 1.79 |
| ADIRA | 1.63 |
| COMPUTER | 1.61 |
| ZORA | 1.6 |
| VANCE | 1.07 |
| CORNWELL | 1.07 |
| SAREK | 1.06 |
| T’RINA | 1.02 |
If anyone is interested, here’s the (rather hurried) Python used:
#!/usr/bin/env python # # This script assumes that you've already downloaded all the episode lines from # the fantastic chakoteya.net: # # wget --accept=html,htm --relative --wait=2 --include-directories=/STDisco17/ http://www.chakoteya.net/STDisco17/episodes.html -m # wget --accept=html,htm --relative --wait=2 --include-directories=/Enterprise/ http://www.chakoteya.net/Enterprise/episodes.htm -m # wget --accept=html,htm --relative --wait=2 --include-directories=/Voyager/ http://www.chakoteya.net/Voyager/episode_listing.htm -m # wget --accept=html,htm --relative --wait=2 --include-directories=/DS9/ http://www.chakoteya.net/DS9/episodes.htm -m # wget --accept=html,htm --relative --wait=2 --include-directories=/NextGen/ http://www.chakoteya.net/NextGen/episodes.htm -m # # Then you'll probably have to convert the following files to UTF-8 as they # differ from the rest: # # * Voyager/709.htm # * Voyager/515.htm # * Voyager/416.htm # * Enterprise/41.htm # import re from collections import defaultdict from pathlib import Path EPISODE_REGEX = re.compile(r"^\d+\.html?$") LINE_REGEX = re.compile(r"^(?P<name>[A-Z']+): ") EPISODES = Path("www.chakoteya.net") DISCO = EPISODES / "STDisco17" ENT = EPISODES / "Enterprise" TNG = EPISODES / "NextGen" DS9 = EPISODES / "DS9" VOY = EPISODES / "Voyager" class CharacterLines: def __init__(self, path: Path) -> None: self.path = path self.line_count = defaultdict(int) def collect(self) -> None: for episode in self.path.glob("*.htm*"): if EPISODE_REGEX.match(episode.name): for line in episode.read_text().split("\n"): if m := LINE_REGEX.match(line): self.line_count[m.group("name")] += 1 @property def as_percentages(self) -> dict[str, float]: total = sum(self.line_count.values()) r = {} for k, v in self.line_count.items(): percentage = round(v * 100 / total, 2) if percentage > 1: r[k] = percentage return {k: v for k, v in reversed(sorted(r.items(), key=lambda _: _[1]))} def render(self) -> None: print(self.path.name) print("| Name | Percentage of Lines |") print("| ---------------- | ------------------- |") for character, pct in self.as_percentages.items(): print(f"| {character:16} | {pct} |") if __name__ == "__main__": for series in (TNG, DS9, VOY, ENT, DISCO): counter = CharacterLines(series) counter.collect() counter.render()
Corgana@startrek.website 1 year ago
Fascinating stuff I love that you did this. I’m surprised Morn didn’t rank higher considering how chatty he is in every scene.
ericjmorey@discuss.online 1 year ago
Number of lines vs number of words spoken vs length of time speaking probably would have a lot of variation in results.