The first thing to do when wrapping and glomming a database is to extract the endeme sets.
Part I – Extracting the endeme sets
To extract candidate endemes from a database, it would be nice to build a tool that would extract endeme sets from a database in preparation to glomming a database. Then you could build an endematic wrapper around an entire database.
Process the items of each column to identify the endeme sets:
- use Levenshtein matrix to coalesce similar stuff.
- use my endemes for 5000 words document to coalesce similar meanings.
- more approaches based in data science.
Lookup tables may be implicit or explicit.
Table Column analyses
Column analyses:
- general content/data columns
– mostly unique items columns [U] – no endeme sets extracted - id/plumbing column
- context column
- lookup id column
- implicit lookup table columns
- Bits – multiple bit columns ‘column’, process multiple items of multiple columns with same type (bit) [B] – 1 set extracted
- Conflated – two or more endeme sets [C] – 2+ sets extracted
- Denormalized – zero normal form column [D] – 1 or 2 sets extracted
- Endematic – endematic range – 16-32 rows [E] – 1 set extracted
- Few – few rows – under 8 [F] – a fraction of a set extracted
- Many different items column [M] – 1 set extracted
- Freetext
– R 1+ freetext column with repeating words
– R 1+ freetext column with repeating concepts - Look for concept sets, concepts have additional structure that endemes do not have
– concepts are generally characterized by two, 3 or 4 endeme sets in a row or chain
Lookup table analyses
Explicit lookup table analyses, process lookup id’s in a databse to build stuff
- explicit lookup table content column
- Bits – multiple bit columns ‘column’, process multiple items of multiple columns with same type (bit) [B] – 1 set extracted
- Conflated – two or more endeme sets [C] – 2+ sets extracted
- Denormalized – zero normal form column [D] – 1 or 2 sets extracted
- Endematic – endematic range – 16-32 rows [E] – 1 set extracted
- Few – few rows – under 8 [F] – a fraction of a set extracted
- Many – many rows – over 64 – [M] 1 set extracted
- Unique – most rows are unique – no endeme sets extracted
- where it’s used as context
Junction table analyses
- there’s got to be something I can do with junction tables.
Views and Reports
Endemizing reports, views, and stored procedures that return data sets (‘report’ and ‘get’ sp’s)
- Context profiles?
- Endematic metadata – reports mostly show numbers, endemes can store relative values
- The row is the endeme item, the endeme indicates how it ‘compares’/’relates’ to other items
Other stored procedures (and inline SQL code)
These may be used to identify contexts and relationships between tables and columns.
I wonder if I can do the same thing with code?