It would seem that I have far too much time on my hands. After the post about a Star Trek “test”, I started wondering if there could be any data to back it up and… well here we go:
The Next Generation
| Name | Percentage of Lines | 
|---|---|
| PICARD | 20.16 | 
| RIKER | 11.64 | 
| DATA | 10.1 | 
| LAFORGE | 6.93 | 
| WORF | 6.14 | 
| TROI | 5.4 | 
| CRUSHER | 5.11 | 
| WESLEY | 2.32 | 
DS9
| Name | Percentage of Lines | 
|---|---|
| SISKO | 13.0 | 
| KIRA | 8.23 | 
| BASHIR | 7.79 | 
| O’BRIEN | 7.31 | 
| ODO | 7.26 | 
| QUARK | 6.98 | 
| DAX | 5.73 | 
| WORF | 3.18 | 
| JAKE | 2.31 | 
| GARAK | 2.29 | 
| NOG | 2.01 | 
| ROM | 1.89 | 
| DUKAT | 1.76 | 
| EZRI | 1.53 | 
Voyager
| Name | Percentage of Lines | 
|---|---|
| JANEWAY | 17.7 | 
| CHAKOTAY | 8.76 | 
| EMH | 8.34 | 
| PARIS | 7.63 | 
| TUVOK | 6.9 | 
| KIM | 6.57 | 
| TORRES | 6.45 | 
| SEVEN | 6.1 | 
| NEELIX | 4.99 | 
| KES | 2.06 | 
Enterprise
| Name | Percentage of Lines | 
|---|---|
| ARCHER | 24.52 | 
| T’POL | 13.09 | 
| TUCKER | 12.72 | 
| REED | 7.34 | 
| PHLOX | 5.71 | 
| HOSHI | 4.63 | 
| TRAVIS | 3.83 | 
| SHRAN | 1.26 | 
Discovery
Note: This is a limited dataset, as the source site only has transcripts for seasons 1, 2, and 4
| Name | Percentage of Lines | 
|---|---|
| BURNHAM | 22.92 | 
| SARU | 8.2 | 
| BOOK | 6.21 | 
| STAMETS | 5.44 | 
| TILLY | 5.17 | 
| LORCA | 4.99 | 
| TARKA | 3.32 | 
| TYLER | 3.18 | 
| GEORGIOU | 2.96 | 
| CULBER | 2.83 | 
| RILLAK | 2.17 | 
| DETMER | 1.97 | 
| OWOSEKUN | 1.79 | 
| ADIRA | 1.63 | 
| COMPUTER | 1.61 | 
| ZORA | 1.6 | 
| VANCE | 1.07 | 
| CORNWELL | 1.07 | 
| SAREK | 1.06 | 
| T’RINA | 1.02 | 
If anyone is interested, here’s the (rather hurried) Python used:
#!/usr/bin/env python # # This script assumes that you've already downloaded all the episode lines from # the fantastic chakoteya.net: # # wget --accept=html,htm --relative --wait=2 --include-directories=/STDisco17/ http://www.chakoteya.net/STDisco17/episodes.html -m # wget --accept=html,htm --relative --wait=2 --include-directories=/Enterprise/ http://www.chakoteya.net/Enterprise/episodes.htm -m # wget --accept=html,htm --relative --wait=2 --include-directories=/Voyager/ http://www.chakoteya.net/Voyager/episode_listing.htm -m # wget --accept=html,htm --relative --wait=2 --include-directories=/DS9/ http://www.chakoteya.net/DS9/episodes.htm -m # wget --accept=html,htm --relative --wait=2 --include-directories=/NextGen/ http://www.chakoteya.net/NextGen/episodes.htm -m # # Then you'll probably have to convert the following files to UTF-8 as they # differ from the rest: # # * Voyager/709.htm # * Voyager/515.htm # * Voyager/416.htm # * Enterprise/41.htm # import re from collections import defaultdict from pathlib import Path EPISODE_REGEX = re.compile(r"^\d+\.html?$") LINE_REGEX = re.compile(r"^(?P<name>[A-Z']+): ") EPISODES = Path("www.chakoteya.net") DISCO = EPISODES / "STDisco17" ENT = EPISODES / "Enterprise" TNG = EPISODES / "NextGen" DS9 = EPISODES / "DS9" VOY = EPISODES / "Voyager" class CharacterLines: def __init__(self, path: Path) -> None: self.path = path self.line_count = defaultdict(int) def collect(self) -> None: for episode in self.path.glob("*.htm*"): if EPISODE_REGEX.match(episode.name): for line in episode.read_text().split("\n"): if m := LINE_REGEX.match(line): self.line_count[m.group("name")] += 1 @property def as_percentages(self) -> dict[str, float]: total = sum(self.line_count.values()) r = {} for k, v in self.line_count.items(): percentage = round(v * 100 / total, 2) if percentage > 1: r[k] = percentage return {k: v for k, v in reversed(sorted(r.items(), key=lambda _: _[1]))} def render(self) -> None: print(self.path.name) print("| Name | Percentage of Lines |") print("| ---------------- | ------------------- |") for character, pct in self.as_percentages.items(): print(f"| {character:16} | {pct} |") if __name__ == "__main__": for series in (TNG, DS9, VOY, ENT, DISCO): counter = CharacterLines(series) counter.collect() counter.render()
Corgana@startrek.website 1 year ago
Fascinating stuff I love that you did this. I’m surprised Morn didn’t rank higher considering how chatty he is in every scene.
ericjmorey@discuss.online 1 year ago
Number of lines vs number of words spoken vs length of time speaking probably would have a lot of variation in results.