The SSN is likely to appear in multiple tables, because they will reference a central table that ties it all together. This central table will likely only contain the SSN, the birth date (from what others have been saying), as well as potentially first and last name. In this table, the entries have to be unique.
But then you might have another table, like a table listing all the physical exams, which has the SSN to be able to link it to the person’s name, but ultimately just adds more information to this one person. It does not duplicate the SSN in a way that would be bad.
Comment on How does this pic show that Elon Musk doesnt know SQL?
halcyonloon@midwest.social 6 days agoTake this with a grain of salt as I’m not a dev, but do work on CMS reporting for a health information tech company. Depending on how the database is designed an SSN could appear in multiple tables.
In my experience reduplication happens as part of generating a report so that all relevant data related to a key and scope of the report can be gathered from the various tables.
Ephera@lemmy.ml 6 days ago
spankmonkey@lemmy.world 6 days ago
It is common for long lived databases with a rotating cast of devs to use different formats in different tables as well! One might have it as a string, one might have it as a number, and the other might have it with hyphens in the same database.
Hell, I work in a state agency and one of our older databases has a dozen tables with databases.
- One has the whole thing as a long int: 222333444
- One has the whole thing as a string: 2223334444 (which of course can’t be directly compared to the one that is a long int…)
- One has separate fields for area code and the rest with a hyphen: 222 and 333-4444
- One has the whole thing with parenthesis, a space, and a hyphen as a string: (222) 333-4444
The main reason for the discrepancy is not looking at what was used before or not understanding that they can always change the formatting when displayed so they don’t need to include the parenthesis or hyphens in the database itself.
pixxelkick@lemmy.world 6 days ago
Okay but if that happens, musk is right that that’s a bit of a denormalization issue that mayne needs resolving.
SSNs should be stored as strings without any hyphen or additional markup, nothing else.
- Storing as a number can cause issues if you ever wanna support trailing zeros
- any “styling” like hyphens should be handled by a consuming front end system, you want only the important data in the DB to maximize query times
It’s more likely though it’s just a composite key…
spankmonkey@lemmy.world 6 days ago
This is not what he is actively doing though. He isn’t trying to improve databases.
He is tearing down entire departments and agencies and using shit like this to justify it.
pixxelkick@lemmy.world 6 days ago
Sure but my point is, if it was the scenario you described, then Elon would be talking about the right kind of denormalization problem.
Denormalization due to multiple different storing their own copies if the same data, in different formats worse yet, would actually be the kind if problem he’s tweeting about.
As opposed to a composite key on one table which means him being an ultracrepidarian, as usual.
DahGangalang@infosec.pub 6 days ago
A given SSN appearing in multiple tables actually makes sense. To someone not familiar with SQL (i.e. at about my level of understanding), I could see that being misinterpreted as having multiple SSN repeated “in the database”.
Barbarian@sh.itjust.works 6 days ago
Theoretically, yeah, that’s one solution. The more reasonable thing to do would be to use the foreign key though. So, for example:
SSN Table ID | SSN | Other info
Other Table ID | SSN ID | Other info
When you want to connect them to have both sets of info, it’d be the following:
SELECT * FROM SSN_Table JOIN Other_Table ON SSN_Table.ID = Other_Table.SSN_ID
schteph@lemmy.world 6 days ago
This is true, but there are many instances where denormalization makes sense and is frequently used.
A common example is a table that is frequently read. Instead of going to the “central” table the data is denormalized for faster access. This is completely standard practice for every large system.
There’s nothing inherently wrong with it, but it can be easily misused. With SSN, I’d think the most stupid thing to do is to use it as the primary key. The second one would be to ignore the security risks that are ingrained in an SSN. The federal government, being large as it is, I’m sure has instances of both, however since Musky is using his possy of young, arrogant brogrammers, I’m positively certain they’re completely ignoring the security aspect.
syklemil@discuss.tchncs.de 6 days ago
To be a bit more generic here, when you’re at government scale you’re generally deep in trade-off territory. Time and space are frequently opposed values and you have to choose which one is most important, and consider the expenses of both.
E.g. caching is duplicating data to save time. Without it we’d have lower storage costs, but longer wait times and more network traffic.
DahGangalang@infosec.pub 6 days ago
Yeah, no one appreciates security.
I probably overused that saying to explain it: ‘if theres no break ins, why do we pay for security? Oh, there was a break in - what do we even pay security for?’
Barbarian@sh.itjust.works 6 days ago
Yeah, I work daily with a database with a very important non-ID field that is denormalized throughout most of the database. It’s not a common design pattern, but it is done from time to time.
DahGangalang@infosec.pub 6 days ago
Yeah, databases are complicated and make my head hurt. Glancing through resources from other comments, I’m realizing I know next to nothing about database optimization. Like, my gut reaction to your comment is that it seems like unnecessary overhead to have that data across two tables - but if one sub-dept didn’t need access to the raw SSN, but did need access to less personal data, j could see those stored in separate tables.
But anyway, you’re helping clear things up for me. I really appreciate the pseudo code level example.
Barbarian@sh.itjust.works 6 days ago
It’s necessary to split it out into different tables if you have a one-to-many relationships. Let’s say you have a list of driver licenses the person has had over the years, for example. Then you’d need the second table. So something like this:
SSN_TABLE
ID | SSN | Other info
Driver_License_Table
ID | SSN_ID | Issue_Date | Expiry_Date | Other_Info
Then you could do something like pull up a person’s latest driver’s license, or list all the ones they had, or pull up the SSN associated with that person.