Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relationships are N-Squared: Provide for Shared Events #134

Merged
merged 11 commits into from
Jul 16, 2012

Conversation

stoicflame
Copy link
Member

This is a comment on the current Record Model.

If the record being extracted holds a multi-role event (for example, say the record is a family in a census), then there typically are relationships between each pair of persons mentioned in the record. This is an N-squared situation. A family group of 5 persons requires potentially 20 relationships, a group of 6 persons potentially requires 30 relationships and so on. I believe that where these potential N-squared situations arise in a model, it behooves one to do something about it, especially if it is trivial to do so, as it is in this case.

I have always advocated not using the relationship concept directly at this point in a record level model. Instead, I believe it is better to think the record object in this case to be an "event" record that points to the persona records, and each pointer specifies the role the person has with respect to the event. In all important cases, knowing what the relationship that two different persons have with respect to a single event, allows an easy inference of their relationship with respect to each other.

This converts the extraction of information from multi-role genealogical records from an O(N-squared) process into an O(N) process. And in the process it prepares the data for easier data processing.

The Record object in the current model could contain these "role-typed" persona pointers, in which case the Record object describes the event in more actual fidelity, and the Relationship object is not needed.

I don't believe the Relationship object should be removed from the overall model however, since later, when conclusions are being built, it can be very important to be able to conclude that a particular relationship existed between two people (say they were cousins of an as yet undiscovered type).

@EssyGreen
Copy link

I would prefer the Relationship object to be removed in favour of the Roles discussed here and in Issue #118 because I think this makes for a cleaner and simpler model. But I'd be happy to compromise providing we got the Roles.

@ttwetmore
Copy link
Author

EssyGreen,

If you do have a chance to check the deadends model, you will see that relationships are accommodated. They are not separate objects, however. Each of the two persons in the relationship refers to the other with a relationship reference that also states the relationship role. A relationship is therefore just a two way pointer between two persons that has the info needed to understand the roles and any other facts that are INTRINSIC TO THE RELATIONSHIP but not intrinsic to the persons per se.

And if you know your GEDCOM by heart you'll recognize that the old ASSO tag has much the same semantic intent.

So sure, keep the relationship, but add those roles!

@EssyGreen
Copy link

Hehe yes I do know my GEDCOM by heart :)

The problem I always had with ASSO tho' was that there was no guarantee that the link was was two-way and since multiple ASSOs for the same person/person were completely possible there was no way to understand which linked to which. So from that point of view I would rather have an object that holds them all together in context ... and as we've agreed I think we both see that as an Event with multiple Roles.

@jralls
Copy link
Contributor

jralls commented Feb 12, 2012

Yes, relationships are often N^2, but by-and-large most of them can be derived rather than stored. For example, sibling relationships can be derived from shared parental relationships.

As for roles vs. relationships in the Record section, ISTM both are necessary. For example, in the US census from 1880, the relationship of each enumerate to the nominal "head of household" is recorded. The role for each persona in that case is "enumerated" and the relationship should be recorded separately. Note that at the Record level, the implied relationships should not be recorded. Inferences belong in the Conclusions section.

@EssyGreen
Copy link

I agree re the siblings, ancestors, descendants etc in the Conclusion model but the Record Model should enable all the relationships to be stored as they were recorded e.g. exactly like you've specified in your census example. The shared events with Roles seems to me to be the most effective way of doing this. see Issue #118

However, I also believe that in creating a Record the very act of breaking the original record down into these man-made objects is a process of Interpretation (one step further than a Transcription or Translation which is already a basic Interpretation) .... albeit one constrained by the bounds of the Record. Bearing this in mind I have no problem with Interpreting events which are not explicit in the original but are implied. For example, if an age and date of event are given I don't have a problem Interpreting an approximate date of birth; if a woman is recorded in the original as a widow I see no problem in Interpreting a previous marriage event to an unknown man etc. I would however love to have a field which enabled me to record whether the event/role was explicit or implicit but I can always build that into my own model.

As I mentioned elsewhere (can't find it right now) I also believe that there should be the ability for a Record to have multiple interpretations. For example, in the past a "step" relation was often used in the same way that the term "in-law" is used today - with totally different consequences to the relationships mentioned. The ability to document both interpretations and then follow through to a single conclusion would be advantageous in these situations.

@ttwetmore
Copy link
Author

Relationships between persons (conventionally called relationships) and relationships between persons and events (conventionally called roles) are both key concepts in a genealogical data model. At times people will argue that you could get rid of one or the other, but you can’t; they are two different concepts, each required in different contexts.

One can then ask how are relationships and roles are to be represented in a data model or a database. In a relational database implementation, both relationships and roles would almost certainly be implemented (“normalized”) as two simple tables. This doesn't imply that they would be separate objects in the model itself. In my models I have never "blessed" the relationship or the role concepts to be full-bodied objects; I have not made them "first class" citizens. I feel strongly about this, but others feel as strongly in the other way (Sarah argued earlier that relationships are first-class citizens.).

Backup. Consider the three forms of genealogical data. First the model, which really isn't data. Second an archive or transmission file format for the data. The data in the file should correspond exactly with the concepts in the model (otherwise there is no point in having a model!). If the model has person objects then the file format should have person records, and so on. Then there is the database format; there need be no direct relationship between the data model/external format and the database format. If the database is a RDBMS then the external data (file format) would presumably be normalized into tables during import, and unnormalized back into "objects" on the way out. Personally I like what used to be called network databases, and which seem to these days be called document-based databases (e.g., Mongo) for genealogical databases. With these databases you can keep the same object structure in the database as you keep in the model and the file. This feels right to me so I have always done it that way.

My N-squared argument is deep in here. The Record Model does not include events. The result of this is that there is no role concept in the record model (obviously, because the concept of a role depends on the concept of an event). The result of this is that all relationships (meaning the superclass concept; that is, relationships AND roles) must all be treated as relationships between persons. This is inappropriate for some records, and I gave the example of census record. As another example consider a marriage record that names two witnesses and the preacher-man. Witnesses especially are very important clues, so you want personas for them (I wouldn’t argue that you really need a persona for the preacher). Given you want the witnesses as personas, how, without roles, are you going hook them up to any other records in the database in meaningful way? You clearly need roles for this. The GEDCOMX Record Model does not have roles. It should. They convert an N-squared representation to a more meaningful N representation in many situations. They are Very Good Things.

@EssyGreen
Copy link

Sarah argued earlier that relationships are first-class citizens

Did I? Oops I thought I was arguing for Roles! (Or maybe that was Roles as Relationships or Relationships as Roles - eek!). Your comments on the differences (and hence the need for both) are interesting ... but I'm still not sure both are necessary. If a person is a "wife" the implication is that there was a marriage event to a "husband"; conversely, if the event is a marriage with roles of "bride" and "groom" the implication is that their relationship was "wife" and "husband" after the event. I personally don't have a problem interpreting between the two but I find it easier to link multiples (more than 2 people) together via an event e.g. say a Census specifies a Householder, Wife, Sister and Son. With one Event and 4 Roles we've got the lot. If instead we have to define the relationships then we have the N-squared scenario or loss of data/context (e.g. Householder<->Wife, Householder<->Sister, Householder<->Son gives us 3 Relationships but tells us nothing about the relationship between "Sister" and "Son" without comparing/evaluating Householder<->Sister and Householder<->Son). For a Record where you want the data to be as transparent as possible I think this makes it less clear hence my argument for Roles.

Conversely in the Conclusion Model I would argue for the simplicity of each Person having a (single) Mother and a (single) Father and anything else being up to the user to specify if they wish. See Issue #131.

Like I said before I think we're in agreement that we need Roles and Relationships - but not necessarily in the same context/model!?

@EssyGreen
Copy link

Actually I think I'm going to scupper my own argument here ... If you wanted to draw a tree based on inheritance law (at least in the UK) then you would need to specify a date which the tree represented and show different parents if a child was adopted depending on that date. So if a child was born in 1900 with parents A+B but was adopted in 1910 to parents X+Y; the tree for 1900 would show a completely different set of ancestors to one which represented the situation in 1910 or after. This is something I'd love to see done but not sure I would personally have the time to develop it. Anyway, in order to do it the application would have to store the parents as Roles with the event (much like the FAMC links in old GEDCOM BIRThs, ADOPtions etc).

For this reason I would argue in favour of Roles in both Record and Conclusion models.

I would however, maintain my argument that there cannot be multiple births (or deaths) in the Conclusion Model.

@jralls
Copy link
Contributor

jralls commented Feb 12, 2012

I agree with Tom about the need to have separate objects/fields/elements/whatever for roles and relationships, and disagree strongly with Sarah that the relationship to the "head of household" is a role in a census event.

Tom, the record model does include events, they're just bundled up in Fact. If you look at the enum FactType in gedcomx-types/.../FactType.java, you'll see that most of the values are events with only a few characteristics mixed in. There's a big argument about doing this in #84 and #85.

@ttwetmore
Copy link
Author

My apologies, Sarah, for misunderstanding your points.

One comment about your point of not having multiple birth and death events in the Conclusion Model person record.

In the DeadEnds model, which has what is called a multi-level or an n-tier person model, person records (for the same individual) can be constructed into trees, with the higher nodes representing conclusions based on evidence and conclusions from the lower nodes, and the leaves being what GEDCOMX calls personas. All the nodes are simply examples of person records. They are brought together into trees because the user decides they all represent records that mention the same real individual. So the higher level persons inherit the birth and death events of the person nodes below them. This makes sense since the tree represents the bringing together of all the information the user believes applies to individual individuals (smily). But we know that there is a high likelihood of errors and inconsistency in the various raw records of the events that we find during research.

When you get to the “top of the person node tree,” which is what in the GEDCOMX terminology is a person record in the Conclusion Model, you have a record that may have many sub-person and persona records hanging below it. If those lower down person records have a fully consistent story to tell about birth and/or death, then no birth or death info need be added to the top person. The top person “inherits” the information from below. However, if there inconsistencies, then the user of this wonder system will have to specify which of the birth/death events from below should apply to the individual as a whole. Of course this doesn’t have to be the same as any of the exact events from below, but an event constructed from info from any of the events below. The choosing of an event, or the construction of a composite event, is a “conclusion” made by the user, so must be justified in the root, individual level person. And of course this conclusion making can actually be made at any level of the tree as it grows.

Imagine the “two-tier” model of New Family Search trees. In that system, persona records (they are never officially called that there, though that’s what they are) are gathered from many sources, including users. Then users of the system can join these personas together into individuals, and they can specify which of the facts from which of the personas they want to be displayed at the individual level. Other users can later rearrange the personas into different sets of individuals, and can change which events from which of the personas to display for the individuals.

So the New Family Search system is a two-tier person system. I believe it more logically should be changed to an n-tier system. As a two-tier system it is logical to talk about a Record Model and a Conclusion Model. But once you “graduate” to an n-tier system it becomes impossible to maintain the illusion that there should be two fixed layers in a genealogical data model.

@EssyGreen
Copy link

No apologies needed @ttwetmore - my fault for not being clear!

When you get to the “top of the person node tree,” which is what in the GEDCOMX terminology is a person record in the Conclusion Model, you have a record that may have many sub-person and persona records hanging below it.

I don't agree that a Person has "sub-Person(a)s" - they may reference a bunch of Personas but I don't believe that a concluded Person is or should be a component of a Person represented elsewhere.

When different "Personas" are referenced and linked to a single "Person" I don't believe that the Concluded Person is the same instance of the Persona (since it is being interpreted in a different context - maybe in a different hypothesis) nor do I believe that the Concluded Person should "inherit from" any of the instances of the Personas. However, the Facts, Roles, Relationships or whatever recorded as linked to the Persona are available (indirectly) for the researcher to assess. For example, say I have 2 records P1 and P2 with conflicting birth events 1850 Bedminster and 1854 Bishport ... I create a P3 with a new concluded birth event say Abt 1852 Bedminster which takes into account (and cites) P1 and P2. The new event is not an instance of nor inherited from P1 or P2 but belongs to P3. However, because P3 links to P1 and P2 (taking aside negative complexities for a moment) then if the application wishes to represent the un-concluded data they can simply "show through" the data from the cross-referenced P1 and P2. But from a data model point of view it would be incorrect to hold the events of P1 and P2 against P3.

@EssyGreen
Copy link

@jralls - going back to the Roles vs Relationships debate, I think this one is interesting so can we thrash it out a bit? I think ...

  • a Role is the part a single person plays in a single event or relationship
  • a Relationship is a state of "connection" between two or more people which may be bound by time and possibly other factors I can't think of!)

Because a Relationship can be bound by time (and possibly place e.g. having a holiday romance maybe?) it is simply an Event. For example, a husband/wife relationship = a marriage event (NB - not the same as the wedding or the marrying event!)

@EssyGreen
Copy link

This probably explains why I my Interpretations are wider than the event being recorded e.g. the Census event for Date D1 above with P1 (householder), P2 (wife), P3 (sister) and P4 (son), I would interpret as:

E1: Residence with Roles P1 (householder), P2, P3, P4 on D1
E2: Marriage: P1 (husband), P2 (wife)
E3: Birth: P4 (child), P1 (father), P2 (mother)
E4: Birth: P1 (child), P5 (father), P6 (mother)
E5: Birth: P3 (child), P5 (father), P6 (mother)

@ttwetmore
Copy link
Author

I hope I don't misinterpret. The person trees I mention are only built up via references. No record is ever a component of another. Trees are built, trimmed, and rearranged by manipulating pointers or ids. Compose does not imply contain.

Sounds like you might prefer 2-tier systems over n-tier ones, but I'm not sure. The key concept in a 2-tier system is individuals are composed (doesn't imply a containment) from all the persona records that hold evidence about them. It is the genealogist's job to figure that out and establish the proper relationships between the persons and the personas. In most current systems there are no persona records; genealogists just add facts to conclusion person records, with each fact reference the specific item of evidence/source it was taken from. Woe be to the users when they discover they added facts from another real person to one of their conclusion persons.

New Family Search is a 2-tier system so it seems natural that GEDCOMX would therefore be 2-tier. However, I haven't figured out in the GEDCOMX model how the two models connect. How does a genealogist, when putting a person record into the conclusion model, refer to the persona records from the record model that justifies the person record. The answer is probably very simply; I just don't see it yet.

Obviously I believe the n-tier system is best. Much of this belief comes from software I have written in the past that automatically compares records and then automatically links them together. Algorithms like this can be very complex and have to be organized into phases that use various statistical techniques. Because there are phases, the linking of records occurs in stages where each stage conceptually takes some persona records and adds them some of the person records that are being built up by the process. By using an n-tier system one can fully track the operation of the phases, and therefore reconstruct the full history, the full set of operations that linked the persons together. This is important -- with 2-tier systems you loose the history of conclusion making, in fact you loose the details of ALL your conclusions -- with n-tier systems you maintain the full history of conclusion making, and that history is fully reversible, that is, you can cleanly "unmake" any of your conclusions. This one property, for me, demands the n-tier system. In one application I wrote the linking as a 2-tier system first, and later was forced to convert it to an n-tier system in order to be able to present and make any sense of what the linking really means. What started as simply a "debugging" aid lead me to understand the necessary properties of an n-tier system in any application where the number of records reaches into the 100s of 1,000s or beyond. The applications I worked on had record counts in the 100s of 1,000,000s.

When I say inherit I mean the conclusion persons inherit the properties of their persona records (limiting thinking to a 2-tier system). All I mean is that there is no need to copy any properties/facts from the persona records, into their conclusion record, if you believe the information in the persona is correct and there is no conflicting information in any of the other persona records. If you believe, by whatever means, that either the information in the personas is incorrect, or there is conflicting information in the personas, then you must add information at the person level to resolve the issue. And of course, the adding of this information has to be seen and treated as making a conclusion.

Some of what you have written still confuses me some, but I sense that there is much we hold in agreement.

@jralls
Copy link
Contributor

jralls commented Feb 13, 2012

E3: Birth: P4 (child), P1 (father), P2 (mother)
E4: Birth: P1 (child), P5 (father), P6 (mother)
E5: Birth: P3 (child), P5 (father), P6 (mother)

That's going way out on a limb. Censuses seldom provide more than one child-parent relationship, and unless the census also documents the length of the marriage one cannot assume that the non-HoH (usually the wife) is the other parent of the children.

@EssyGreen
Copy link

@ttwetmore

Sounds like you might prefer 2-tier systems over n-tier ones

No, I totally agree with you re the N-tier - I was trying to keep it simple in the discussion :)

How does a genealogist, when putting a person record into the conclusion model, refer to the persona records from the record model that justifies the person record.

I agree this needs clarifying - one for Ryan :)

there is no need to copy any properties/facts from the persona records, into their conclusion record

I totally agree - I think I just got confused by your use of the word "inherit"

Some of what you have written still confuses me some, but I sense that there is much we hold in agreement.

I agree - I think we have a bit of a semantics problem - mostly seem to be saying the same things in different ways :)

@jralls
Copy link
Contributor

jralls commented Feb 13, 2012

  • a Role is the part a single person plays in a single event or relationship
  • a Relationship is a state of "connection" between two or more people which may be bound by time and possibly other factors I can't think of!)

Because a Relationship can be bound by time (and possibly place e.g. having a holiday romance maybe?) it is simply an Event. For example, a husband/wife relationship = a marriage event (NB - not the same as the wedding or the marrying event!)

OK. That's pretty much the same argument Ryan makes in #84 when he justifies collapsing events and characteristics into facts. From a data modelling standpoint relationships need to be a separate class so we can use them to link our descendency graphs.

I'm not really keen on atomizing evidence into a bunch of separate records, though I see the attraction for FamilySearch with their built-in trees. It's very difficult to get all of the evidence, especially the contextual evidence (great-great-grampa's wife was Mary. Oh, look, in the census before they got married, there's a Mary of the right age two houses over. Hmm, what can we find out about her?)

@EssyGreen
Copy link

@jralls

That's going way out on a limb

Why? How would you interpret it?

Censuses seldom provide more than one child-parent relationship

Surely not? There are many instances where there are grandparents, sisters, cousins etc etc. In my opinion these are better Interpreted within the scope of the record before being linked in the Conclusion model and that's the benefit of derivatives .... I get a (derived) photocopy of the original; I might Transcribe it (into another derived record), if in say Latin then I might also Translate (derived from the Transcription) then I Interpret (into Facts/Roles etc) from any one of these into another derived record. If the record was ambiguous then I might have more than one Interpretation. Finally I can create a Concluded record and link to the derived interpretation. If anything goes astray I can trace all the way back up to see the level of detail I need and determine where it went wrong.

unless the census also documents the length of the marriage one cannot assume that the non-HoH (usually the wife) is the other parent of the children.

You can Interpret it to mean that and/or create multiple interpretations for the different possibilities and/or you can leave it out. It is up to the individual researcher. Providing you keep the layers which the interpretation was taken from there is no problem. I don't see this as any more problematic than say Ancestry parsing the data into Name/Date/Place fields (and hence incurring errors such as mis-reading handwriting or mis-understanding where the places are).

@EssyGreen
Copy link

@jralls

From a data modelling standpoint relationships need to be a separate class so we can use them to link our descendency graphs

Can you explain? I don't understand

@ttwetmore
Copy link
Author

I pretty much agree with Sarah's view that census events imply other events, and that we should infer them. She may be going out on a limb but it a statistically sound limb. There will be errors introduced, but there will be far more birth events to work with.

(I have some census-handling software that takes census records and generates the residence and birth events exactly as Sarah has summarized them. Note on these constructed birth events it is usually possible to add the birth places of the parents, even if their names are not known). Plus we can often estimate the date of the marriage of the head of household and spouse [thus eliminating John's concern in some cases], and we often know extra info based on the "number of children/number of living children" fields for married women.)

I think that when we leave the comfortable world of easy person-based genealogy and we are far enough back that we have to enter the more challenging world of records-based genealogy, our work becomes ridden with errors and inconsistencies that we must learn to make sense of. I don't know how (or even if) it is possible to make a convincing statistical argument for this, but I believe that the value of the addition of vast numbers of inferred birth events (with parents assigned) from census records as outlined by Sarah, will greatly outweigh the problems introduced by sometimes having one or the other or even both of the parents wrong.

All records-based genealogy involves learning to cope with errors and inconsistencies in the records. We cannot reject a significant source of valuable records just because we know they introduce a certain percentage of errors.

@EssyGreen
Copy link

Exactly so :)

On a side-line my great fear at the moment is the propagation of highly inaccurate data through sites which (for commercial reasons) are combining social networking and genealogy. I fervently hope that GEDCOMX can lay down some standards that will encourage users to think (e.g. interpret, analyse, hypothesise and conclude) rather than just bulk copy (which usually breaches data integrity, privacy and copyright in one fell swoop!).

@ttwetmore
Copy link
Author

John,

Thanks very much for straightening me out on where the event concept resides in the record model. I'll think about that and maybe comment later.

@jralls
Copy link
Contributor

jralls commented Feb 13, 2012

E3: Birth: P4 (child), P1 (father), P2 (mother)
E4: Birth: P1 (child), P5 (father), P6 (mother)
E5: Birth: P3 (child), P5 (father), P6 (mother)

That's going way out on a limb.

I pretty much agree with Sarah's view that census events imply other events, and that we should infer them. She may be going out on a limb but it a statistically sound limb. There will be errors introduced, but there will be far more birth events to work with.

Yes, facts recorded in census records can imply other events. Where Sarah goes out on a very weak limb is the assumption
that P2 is the mother of P4 and that P1 and P3 share both parents. Neither of those relationships is implied by the census record at hand: There are far too many cases where a man's current wife is not the mother of (all) his children, and to assume otherwise is very bad practice.

Censuses seldom provide more than one child-parent relationship, and unless the census also documents the length of the marriage one cannot assume that the non-HoH (usually the wife) is the other parent of the children.

Surely not? There are many instances where there are grandparents, sisters, cousins etc etc. In my opinion these are better Interpreted within the scope of the record before being linked in the Conclusion model and that's the benefit of derivatives

In all of those cases the census documents a single relationship, that to the HoH. I have no problem with inferring an event that's directly supported by a document: Obviously, if a person is enumerated, that person was very likely born some time before the census (I haven't yet found any cases where a census taker enumerated someone who didn't exist, but I wouldn't completely it out, either) -- maybe even at the time indicated by the recorded age (though my great-grandmother Ralls is 43 in both the 1910 and 1920 censuses). If the person in question is listed as the HoH's son, well we can conclude that he probably is, but that tells us nothing about who was his mother. If the son is 10 and the census records that the parents have been married 12 years, we might, in the absence of other evidence, infer that the wife is indeed the son's mother, but we shouldn't have too much confidence in that until we find some corroborating evidence.

@ttwetmore
Copy link
Author

John,

I understand what you are saying. You are more concerned with avoiding errors, and I am more concerned with adding data. This argument boils down to where along the spectrum of more data versus increasing error ratio one is comfortable to live with. As you point out you can never guarantee that the spouse in a household record is the other parent of the children. If the children are young the probability goes up. If the record says how long the spouses have been married (inferable in many records) and the children fit that time period then the probability goes up. But you can never know for sure. As you can never know for sure that the head is one of the parents either. There is nothing absolutely true in genealogical data other than direct eye witness evidence. I know exactly when each of my parents died because I was at their bedsides each time. My Dad's grave marker has the wrong year.

The question here is what is more valuable -- data with a known ratio of errors -- or no data at all -- or what error rates are acceptable to use the data?

There are objective criteria that can be used to determine what ratio of errors is acceptable. That is, we don't have to just argue about it on theoretical, "errors are bad," terms. Errors can be measured, and the effect of those errors can be measured against the accuracy of the overall conclusions made in establishing family groups and pedigrees. I can't give you a formula for doing these tests now, but they are certainly possible, and certainly have to be completed before we can truly find the error ratio we should be willing to accept.

@jralls
Copy link
Contributor

jralls commented Feb 13, 2012

Tom,

Genealogy isn't about data. We have artifacts, including documents, that provide evidence. We search diligently to find all of the artifacts about a person that we can, we carefully analyze the artifacts to extract the evidence that the artifacts provide directly, and we write a proof argument (or a set of proof arguments) about that evidence in which we discuss the quality of the artifacts that we've found, the credibility of the informants, the proximity of the artifacts to the events that they record, and so on. We draw conclusions from the evidence, explaining why we prefer some evidence to other, conflicting evidence when that arises. We synthesize the results into a biographical sketch of the subject, and either write the sketch in prose or slice it up into little pieces and enter it into our genealogy programs. It would be nice if those programs were written to help with the process or even just to document it instead of just recording the final conclusions, but none of them are.

The only evidence directly contained in the census record in Sarah's example is the relationship of each enumerate to the HoH. I said earlier that I'm not keen on slicing up evidence, but obviously FS and Ancestry have to do so in order to allow us to search the records. If they assign P2 as P4's mom, though, and she isn't, that's going to hurt searchability for researchers who already know who P4's mom is and include the correct information on their search forms.

It's fine -- necessary, even -- to infer events for which there is no direct evidence. The place for recording those inferences is in the proof argument, not in the record of evidence.

The error you're introducing here is not the kind that works well with statistical analysis of what's an acceptable error rate. That applies to randomly distributed errors like typing errors or misreading handwriting. Well designed processes are aimed at minimizing errors. For example, FS has two people index records independently, and any differences are reviewed by a third independent arbitrator.

The other counter argument to your "acceptable error" is garbage in, garbage out. There's enough garbage in the original records. It's stupid to add more.

@jralls
Copy link
Contributor

jralls commented Feb 13, 2012

From a data modelling standpoint relationships need to be a separate class so we can use them to link our descendency graphs

Can you explain? I don't understand

If I said "family tree" instead of "descendancy graph", would it be clearer?

It's easier to illustrate the construction of the tree/graph in the model if the objects used to form the arcs on the graph are in their own class rather than a subset of some other class. That could be accomplished by making Relationship a subclass of Fact or an unrelated class, whatever makes the model easier to understand.

That doesn't mean that it needs to be a separate class in the implementation. The goal is different there, to balance performance, storage density, and code maintainability (not necessarily in that order! ;-) ),

@ttwetmore
Copy link
Author

John,

Thanks for your detailed response. You make excellent and conventional arguments, but we see things from different perspectives. For me genealogy is all about data, and I am very interested in bringing computing techniques to bear upon the problem of discovering links between persons mentioned in different record sets, in a rational manner based on firm mathematical and statistical principles, finding algorithmic techniques that recognize errors in a statistically significant way. Most genealogists have no faith that such computing techniques would ever be able to provide such a valuable service, or are even conscious that sophisticated computing techniques might even exist to help them. For the processes that I imagine, one must accept the presence of errors, because they are there, so one devises statistically sound methods to minimize the impact of them. When generating birth events from census records I would want to generate records both with only the head of household as a single parent and with head and spouse as two parents, and run detailed experiments to determine whether the advantages of having two parents for the purpose of finding out which set of data does a better job of family recognition and pedigree generation outweighs the disadvantages of the additional errors. You simply cannot state by fiat which will be better; one must find out. For me it is an engineering decision, not a theoretical one. I want to figure out the best way that genealogical computing can help recognize the same human beings as they are manifested in different record sets.

@jralls
Copy link
Contributor

jralls commented Feb 13, 2012

Tom,

OK, that's a completely different, and very interesting, approach. Considering that very little genealogical evidence is available in digital form and that the contextual data needed to properly evaluate the evidence is seldom encoded, I suspect you're pushing the envelope just a bit. Good luck with it though.

That said, I don't think that a long-term research project a galaxy or two away from mainstream genealogy should drive a standard interchange format.

@ttwetmore
Copy link
Author

John, Thanks. I also agree that long-term research projects should not drive the transport/archive formats. Being able to support the n-tier person model is all that my ideas need however, and there is already some pressure to add this support from other areas Adding that capability is as simple as allowing person references to occur in person records. No real impact on the data model (other than adding one line!).

@jralls
Copy link
Contributor

jralls commented Feb 13, 2012

So justindex dae11cc..0069d67 100644
--- a/gedcomx-conclusion/src/main/java/org/gedcomx/conclusion/Person.java
+++ b/gedcomx-conclusion/src/main/java/org/gedcomx/conclusion/Person.java
@@ -45,6 +45,7 @@ public class Person extends GenealogicalEntity implements Pers
private List genders;
private List names;
private List facts;

  • private List componentPersons;

/**

  • Living status of the person as treated by the system. The value of this pr

(along with a getter and setter, of course, since there aren't constructors, and some (de)serialization code)?

Or do you mean the equivalent in Persona.java? I don't see a PersonReference type anywhere, so I'm guessing that you mean (in C++) &Person.

I think as I understand this thing right now, it would make more sense in Record than Conclusion, but I still don't grok how the two fit together, so I might well change my mind about that. Other than that I have no issue with it.

@stoicflame
Copy link
Member

Sorry for the delay on this.

See changes at 8927d5c, 8927d5c, and 8927d5c.

Now what? Anything else before I merge?

@stoicflame
Copy link
Member

What on earth is "Move"?

Maybe we should rename it to Relocation?

@jralls
Copy link
Contributor

jralls commented Jul 10, 2012

Now what? Anything else before I merge?

Paragraph 5.1 still requires that "The sources of a conclusion MUST also be sources of the conclusion's containing entity (i.e. Person or Relationship )." That needs to be reworked to take handle Events.

The catch-all value for EventRole is still "Witness". Concordant with many extant programs, but doesn't always express the correct meaning.

@jralls
Copy link
Contributor

jralls commented Jul 10, 2012

What on earth is "Move"?

American for "Removal".

Maybe we should rename it to Relocation?

Can't we just include it in the overloads of "Departure" and "Arrival"?

@stoicflame
Copy link
Member

Can't we just include it in the overloads of "Departure" and "Arrival"?

Hmm... maybe. Tracking at #186

@stoicflame
Copy link
Member

That needs to be reworked to take handle Events.

How? Events don't contain any conclusions, at least not on the branch we're collaborating on here. I'm aware of the type refactor being coordinated across other threads, but that's separate from this discussion.

The catch-all value for EventRole is still "Witness".

Actually, the catch-all value is null. Just leave it empty if none of the known roles fit. Or put in your own custom type.

That concept is not unique to EventRole, that's applicable to wherever we're maintaining a controlled vocabulary. If that needs to be clarified, let's open up a separate issue to address all those places.

@jralls
Copy link
Contributor

jralls commented Jul 11, 2012

Events don't contain any conclusions

Ah, sorry, I thought for some reason that EventRole extended Conclusion rather than GenealogicalResource.

let's open up a separate issue to address all those places.

#187

@EssyGreen
Copy link

Why did the ordering get taken out? I realise the order was questioned by @jralls and I guess I missed the opportunity to defend it so am doing so now by answering: the order which the researcher thought was important!

  • accounts, emails, phones etc ... order would probably be most used
  • addresses - e.g. local branch might come before head office
  • sources - might be the order of discovery
  • alternate forms - might be ordered by probability/frequency of use
  • facts - might be order in which they occurred where dates are approximate

etc etc

The point being that the order was determined by the originating researcher and/or application. The message to any importing application should be: Don't mess with it unless you have to!

@EssyGreen
Copy link

What on earth is "Move"?

Maybe we should rename it to Relocation?

We had this discussion before ... Move/Relocation has to be either away from or to somewhere so it needs 2 fact types/events like immigration/emigration in case peeps die enroute etc

@jralls
Copy link
Contributor

jralls commented Jul 11, 2012

The point being that the order was determined by the originating researcher and/or application.

My original question was "ordered by what?". That's a programming-domain question, not a conceptual-domain one. If the answer is "ordered list in the sense of xsd:sequence" (i.e., implementations should preserve document order), OK, that works for me.

@EssyGreen
Copy link

The point being that the order was determined by the originating researcher and/or application.
My original question was "ordered by what?". That's a programming-domain question, not a conceptual-domain one.

Yes I get that .. so I should have said "change ordered list to a list which maintains its original order" :) The exact term depends on the language you're using.

@stoicflame
Copy link
Member

If the answer is "ordered list in the sense of xsd:sequence" (i.e., implementations should preserve document order), OK, that works for me.

That was the intent, yes.

Doesn't the concept of a "list" imply that it "maintains it original order"? It does in Java, anyway. So that's why I removed the "ordered" qualifier... it was redundant and (based on the fact that it was brought up) it seems to cause confusion.

@ttwetmore
Copy link
Author

I agree that lists are implicitly ordered. So there is one item of ambiguity concerning lists that should be specified up front.

A genealogical object may have many properties of the same type (e.g., a name, a birth event, etc.) When multiple properties of the same type are found in an object, it must be clear which is the one that is the "preferred" value of the property, that is, the one to be shown in displays or to be treated as the most important, or the one to be used in age or other genealogical algorithms.

Some people choose the first and some choose the last. It should be made clear from the beginning by having it defined into the specs.

In the record model this issue would generally not arise, as most records only include one value for each property. However, in the conclusion model one might want to list all the names found on all the items of evidence that the researcher has decided refer to the same person.

[Aside: In the LifeLines program I wrote eons ago I chose the "first is best" interpretation. After putting the program into the public domain someone willy nilly changed that interpretation in a few spots to be "last is best." In the current release you have to experiment on a case by case basis to discover which is which. Names are first is best. Deaths are last is best. Not a happy situation.]

@stoicflame
Copy link
Member

it must be clear which is the one that is the "preferred" value of the property

See discussion at #176

@EssyGreen
Copy link

Doesn't the concept of a "list" imply that it "maintains it original order"?

No there are many different types of list :)

@stoicflame
Copy link
Member

Okay. See d234f8b for the wording to clarify that the order is lists is preserved.

I'm going to give a day or so for further comments. Assuming no big objections, I'm going to merge.

Conflicts:
	specifications/conceptual-model-specification.md
	specifications/json-format-specification.md
	specifications/support/gedcomx.zargo
stoicflame added a commit that referenced this pull request Jul 16, 2012
Relationships are N-Squared: Provide for Shared Events
@stoicflame stoicflame merged commit 05a6e22 into master Jul 16, 2012
@stoicflame
Copy link
Member

Merged at 05a6e22

@jralls
Copy link
Contributor

jralls commented Jul 20, 2012

Ryan, can you convert this back into an issue and re-open it? There are a ton of things discussed here (and which you consolidated from other issues) which aren't covered by shared-events.

@EssyGreen
Copy link

Maybe it would be easier to have new issues to discuss the other stuff .... it's a bit of a mammoth thread already

@stoicflame
Copy link
Member

Maybe it would be easier to have new issues to discuss the other stuff .... it's a bit of a mammoth thread already

Yes, please. If there are issues that this thread spawned, let's open them separately rather than making people trudge through this one to get context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants