Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need mechanism to distinguish "extracted conclusions" from "working conclusions" #202

Closed
wants to merge 4 commits into from

Conversation

stoicflame
Copy link
Member

In issue #144, we discussed being able to model "extracted conclusions" (Conclusion instances representing persons, relationships, etc. extracted from a record) and "working conclusions" (the Conclusion instances that represent the current state of the researcher's work). But as both type of conclusions are modeled with Conclusion instances, we need a mechanism to distinguish Conclusions in one role or the other (extracted vs. working).

@jralls has suggested adding an "extracted" flag to the Conclusion class as a potential solution to this issue. @stoicflame has voiced (to me) that he feels that this concept belongs in metadata about the conclusion -- that it does not belong as part of the conclusion itself.

How is this concept best expressed in the model? What other options might we consider?

@jralls
Copy link
Contributor

jralls commented Aug 10, 2012

Ryan, would you elaborate on what you mean by metadata? ISTM all of the Conclusion class's parameters are metadata.

@EssyGreen
Copy link

I don't believe there is a need for this ... if a Conclusion references a single Source then surely it can only have been "extracted" from that source in some way? If it references multiple sources then it must be some form of "worked" conclusion. If it references none, then it is simply the imagination/knowledge of the researcher.

Why do we need to make the method by which the conclusion was created explicit? And if we do need to then are "extracted" and "working" the right/only terms?

@jralls
Copy link
Contributor

jralls commented Aug 12, 2012

No, even a single source will yield inferred evidence. An 1841 census says that Thomas Hartley is 25, so an age "Fact" is extracted evidence. One can infer that he was therefore born between 7 June 1815 and 6 June 1816, but that would not be extracted.

The reason behind making the distinction goes back to the Gentech GDM and arrives here via Tom Wetmore's N-Tier model, where conclusions are built up by aggregating "lower-level" conclusions. I don't think many genealogists actually work that way -- they seem to prefer to take a more single-level approach, collecting all of the evidence they can and then writing a single essay (proof argument) -- but database architects have trouble modeling that, so they resort to this layered approach.

@EssyGreen
Copy link

even a single source will yield inferred evidence

Ah I see - you mean you want to differentiate between explicit and implicit/inferred information? That's different in my mind to differentiating between "extracted" and "working" conclusions. .... OK ... I'd go along with some sort of "Method" field. So the valid values would be something like, maybe: Transcribed, Translated, Implied, Inferred?

@jralls
Copy link
Contributor

jralls commented Aug 13, 2012

you mean you want to differentiate between explicit and implicit/inferred information

Not me. I think atomizing sources into conclusion snippets is a waste of time. I want to link a bunch of sources to an AnalysisDocument which lays out the case for a group of conclusions (perhaps the place, date, facts, and participants of an Event) and then link that as a source to the (minimally) atomized conclusions. "Laying out the case" includes explaining ones inferences.

The N-tier model is what demands a distinction between the types of evidence.

@EssyGreen
Copy link

The N-tier model is what demands a distinction between the types of evidence.

Why? How-so?

@jralls
Copy link
Contributor

jralls commented Aug 15, 2012

As I understand Tom's model, the idea is that you start off with a Source, extract the explicitevidence into what Thad has designated as ExtractedConclusion attached to a Person, then aggregate those Persons into composites as you demonstrate that the component Persons are representations of the same historical person -- the latter demonstrations being designated WorkingConclusions. (The "working" part comes from your assertion that conclusions aren't conclusive.)

@EssyGreen
Copy link

My understanding of the N-tier approach is simply that conclusions can include and reference other conclusions ad infinitum. I don't see why that necessitates differentiating conclusions by the method used to create them - it seems to me to make it harder to implement N-tier since you then have to worry about the validity of ExtractedConclusions interlinking/interacting with WorkingConclusions.

The "working" part comes from your assertion that conclusions aren't conclusive

I said: "I strongly believe that there is no "Conclusion" in genealogy". I was arguing for the allowance of multiple and contradicting hypotheses in the research process not for making some "conclusions" conclusive and others not!

@thomast73 - since you created the post can you clarify what the issue is here?

@nilsbrummond
Copy link

if extracted conclusions are extractions of evidence found in a single source then perhaps the extracted conclusion should have optional fields related to evidence analysis.

  • primary / secondary
  • direct / indirect
  • readability
  • etc (What every is appropriate from EE by Mills chapter 1.)
  • There should NOT be a negative evidence field as it can only be negative when compared to other evidence. Negative evidence would be part of the Analysis Document?

Where a working conclusion would not have such fields as is it based on looking at all available evidence together. It just need to be linked to the Analysis Document or the extracted evidence or both..

Not me. I think atomizing sources into conclusion snippets is a waste of time.

I think there could be some useful tools created to help in evidence analysis with atomizing sources.

I would like software that helps the research and analysis process, not just documents the results, personally. If it exists please let me know. Right now I use FTM2012 but have been look for something better.. Ended up here hoping someone will make it someday...

@thomast73
Copy link
Contributor Author

...you start off with a Source, extract the explicit evidence into what Thad has designated as ExtractedConclusion attached to a Person, then aggregate those Persons into composites as you demonstrate that the component Persons are representations of the same historical person -- the latter demonstrations being designated WorkingConclusions.

I would have said this thus:

...you start off with a Source, extract the explicit evidence into a Person -- designated an "extracted conclusion", then aggregate those Persons into composites as you demonstrate that the component ("extracted") Persons are representations of the same historical person -- the aggregate Persons being designated "working conclusions".

@thomast73 - since you created the post can you clarify what the issue is here?

The issue is that the same objects are being used to model "extracted" and "working" conclusions. In the above example, a Person object is used to model both "extracted" and "working" conclusions. There is no way to programmatically examine these instances of Person and know which ones are "extracted" which one represents the current hypothesis.

@EssyGreen
Copy link

I think there could be some useful tools created to help in evidence analysis with atomizing sources.

+1

In particular I think comparison of the conclusions of a source with other conclusions is critical and this is easier if they are the same type of object (regardless of whether or not they are "extracted" or "working" or whatever).

I would like software that helps the research and analysis process, not just documents the results, personally. If it exists please let me know. Right now I use FTM2012 but have been look for something better.. Ended up here hoping someone will make it someday...

I totally agree (and am in exactly the same situation albeit using FH and FTM and dipping into a few others along the way) but I strongly believe that any meaningful analysis is dependent upon having cohesive data .. this is why I do not agree with the chuck-all-the-conflicts-in-one-person-and-sort-it-out-later approach and why I would like to see the ability to have Hypotheses (cohesive sets of conclusions). That way it gives the researcher the freedom to follow multiple trails and their respective probabilities until/if a "conclusion" (in the general sense of the word) is reached.

..you start off with a Source, extract the explicit evidence into what Thad has designated as ExtractedConclusion attached to a Person, then aggregate those Persons into composites as you demonstrate that the component Persons are representations of the same historical person -- the latter demonstrations being designated WorkingConclusions.

I would have said thus:

... you start off with a Source and interpret it into a number of Hypotheses (frequently only one but sometimes sources are vague and allow for multiple interpretations). Each Hypothesis has a number of "extracted" Conclusions. You then go back to your original Hypotheses (from previous research) and for each you compare the relevant "working" Conclusions with each of the "extracted" Conclusions (in each of the Hypotheses) to see how well they fit. You link them together with +ve/-ve evidence. You then re-evaluate your original hypotheses, set some new goals and go off in search of further evidence to prove/disprove them.

The issue is that the same objects are being used to model "extracted" and "working" conclusions. In the above example, a Person object is used to model both "extracted" and "working" conclusions. There is no way to programmatically examine these instances of Person and know which ones are "extracted" which one represents the current hypothesis.

An "extracted" Person will be contained within a single source (that which it was extracted from). The parent source of a "working" Person is the GEDCOM-X file itself (or in my case the Hypothesis).

@jralls
Copy link
Contributor

jralls commented Aug 16, 2012

I think there could be some useful tools created to help in evidence analysis with atomizing sources.

OK. But is that justification for putting it in a results-oriented interchange format like GedcomX?

I would like software that helps the research and analysis process, not just documents the results, personally. If it exists please let me know. Right now I use FTM2012 but have been look for something better.. Ended up here hoping someone will make it someday...

So would most of us. It's not what "here" is about, though.

@jralls
Copy link
Contributor

jralls commented Aug 16, 2012

... you start off with a Source and interpret it into a number of Hypotheses (frequently only one but sometimes sources are vague and allow for multiple interpretations). Each Hypothesis has a number of "extracted" Conclusions. You then go back to your original Hypotheses (from previous research) and for each you compare the relevant "working" Conclusions with each of the "extracted" Conclusions (in each of the Hypotheses) to see how well they fit. You link them together with +ve/-ve evidence. You then re-evaluate your original hypotheses, set some new goals and go off in search of further evidence to prove/disprove them.

And how do you structure an "Hypothesis" object? My preference is for hypotheses to be documented in a textual argument, but it appears that you have something else in mind.

+ve/-ve?

@EssyGreen
Copy link

how do you structure an "Hypothesis" object

To me it's just a source object with "working" "conclusions" in it (=Persons)

+ve/-ve?

Positive or negative evidence

@nilsbrummond
Copy link

I think there could be some useful tools created to help in evidence analysis with atomizing sources.
OK. But is that justification for putting it in a results-oriented interchange format like GedcomX?

Is GedcomX results-oriented? By results-oriented do you mean used to store completed research, and not necessarily in-progress research? The use case of allowing migration from one application to another with out data loss would not be supported then?

I would like software that helps the research and analysis process, not just documents the results, personally. If it exists please let me know. Right now I use FTM2012 but have been look for something better.. Ended up here hoping someone will make it someday...
So would most of us. It's not what "here" is about, though.

The logical model of gedcomX has the potential to affect a lot of what will become available as software features. I agree it is not what "here" is about, but needs some consideration when doing what here is about. Look at how many apps have directly used gedcom's logical model as their own..

@EssyGreen
Copy link

I agree with @nilsbrummond :) however ...

The use case of allowing migration from one application to another with out data loss would not be supported then?

Completely loss-less migration is unlikely since every app will necessarily have its own features/data in order to compete in the market. But what GEDCOM-X can/needs to do is extend the "core" that generally is/will be supported.

@jralls
Copy link
Contributor

jralls commented Aug 16, 2012

Is GedcomX results-oriented? By results-oriented do you mean used to store completed research, and not necessarily in-progress research? The use case of allowing migration from one application to another with out data loss would not be supported then?

No. By results-oriented I mean that GedcomX reflects the search-source-argument-conclusion model of the Genealogical Proof Standard. It is not AFAICT a use-case to capture everything that any future genealogical program might want to store.

The logical model of gedcomX has the potential to affect a lot of what will become available as software features. I agree it is not what "here" is about, but needs some consideration when doing what here is about. Look at how many apps have directly used gedcom's logical model as their own.

Roger, and that's why Sarah, Tom, and I have been working so hard to move the GedcomX model towards supporting the full range of the GPS's proof requirements. There is no extant software that does, but we hope that by having a data model with LDS backing (they are, after all, the largest cohesive genealogy market) that does, the software will follow.

I also recognize that most extant genealogy software does atomize evidence, and that GedcomX needs to support that atomization in a way that permits that extant software to use it or GedcomX will be stillborn. You've taken my earlier statement about my dislike of atomizing evidence, which was in the context of explaining why this issue is here, and turned it into a declaration that GedcomX shouldn't support it.

The resident expert on analysis-support software is Tom Wetmore. Go read through #134 for some really interesting explanations of what he wants to do. See if you think his vision lines up well with the GPS, if its "in-scope" for GedcomX, and if anyone else is likely to write similar programs for DeadEnds to exchange data with.

@EssyGreen
Copy link

most extant genealogy software does atomize evidence

Are we meaning different things here ... I don't know any software which does this ... there is no equivalent in GEDCOM 5 or any software I have seen for dissecting a Source into Persons, Events etc whilst maintaining the context of the original source.

@EssyGreen
Copy link

GedcomX reflects the search-source-argument-conclusion model of the Genealogical Proof Standard

GEDCOM-X is still struggling to support the GPS. It hardly "reflects" it - see #191

@stoicflame
Copy link
Member

Okay, folks, sorry for the response delay. Been busy.

The flag on the conclusion makes me nervous because it mixes the concept of how the data is to be used with the data itself. It smells to me.

I guess I'd like something that conceptually contains the extracted conclusion(s) so that the context can be applied at the level of the container. I don't know, something like a new object, maybe ConclusionSet that contains pointers to the persons, relationships, and events in the set. You might have it contain sources or maybe some of the other things @nilsbrummond has suggested.

And don't bother bringing up the record model. I get it. You told me so. :-) The difference is that my suggestion here is a reference model and not an encapsulation model.

@EssyGreen
Copy link

I'd like something that conceptually contains the extracted conclusion(s) so that the context can be applied at the level of the container

Surely this is the GEDCOM-X file itself (which is itself a source)? (Assuming we agree that the GEDCOM-X file may be used as a source and therefore should be defined as one)

@stoicflame
Copy link
Member

Surely this is the GEDCOM-X file itself (which is itself a source)? (Assuming we agree that the GEDCOM-X file may be used as a source and therefore should be defined as one)

The file will contain both "extracted" conclusions and "working" conclusions (as we've been using the terms). I'm talking about something more fine-grained than the file. A container for all conclusions (persons, relationships, events) that are extracted from a single source.

@EssyGreen
Copy link

The file will contain both "extracted" conclusions and "working" conclusions (as we've been using the terms). I'm talking about something more fine-grained than the file. A container for all conclusions (persons, relationships, events) that are extracted from a single source.

The way I see it a set of conclusions always resides within a source ... this source might be a simple document; or a book; or a project; or a tree; or a branch; or a set of conclusions about people with the same surname (1-name research); they might have several projects/trees/whatever within one GEDCOM-X file or they might just lump them altogether in one GEDCOM-X file with no structure at all. It doesn't really matter since the GEDCOM-X file is itself a source so forms the outermost source.

Some of these "sets of conclusions" may contradict one another (e.g. I would create different sources to cater for different hypotheses ie to follow parallel trails to resolve a scenario with multiple possibilities) and some might compliment each other. The way the sources are used by the researcher determines this ... e.g. a hypothesis that resulted in an impossibility might be used as negative evidence; whilst another might be used a proof.

This way we have a very flexible way of building and linking together sources and conclusions. If we try to provide a different type of object then we lose that flexibility and create instead a rigid hierarchy.

@jralls
Copy link
Contributor

jralls commented Aug 18, 2012

most extant genealogy software does atomize evidence

Are we meaning different things here ... I don't know any software which does this ... there is no equivalent in GEDCOM 5 or any software I have seen for dissecting a Source into Persons, Events etc whilst maintaining the context of the original source.

Maybe. But "dissecting a Source into Persons" isn't what I was trying to describe. Most genealogical software and Gedcom5 have little atoms of conclusion: a "fact" or an "event" which have attributes of date, place, one or more linked persons ("individuals"), and a list of "sources" which are really citations. It's a rare source that contains only one "fact", so one creates a bunch of "fact" records, each of which has a pointer to the same "source" record. If you find sources which disagree about something, you have to create multiple "facts" with different values and mark one of them "preferred" or something. That's what I mean by "atomizing evidence".

@jralls
Copy link
Contributor

jralls commented Aug 18, 2012

Assuming we agree that the GEDCOM-X file may be used as a source and therefore should be defined as one

I agree that a GedcomX file may be used as a source and cited appropriately. In that case it's no different from a compiled genealogy in print. I don't think that that has anything to do with how the file is defined, nor has any bearing on the question at hand.

set of conclusions always resides within a source

I'd say "depends upon one or more sources". A conclusion based on only one source is weak. "Resides within" implies that the conclusion is contained in the source, which might be true (if the source is, say, a compiled genealogy) but then the conclusion is extracted. I didn't think it up myself, I just copied it from the source. The interesting sort of conclusion, one that I think up after thorough research and analysis, be "reside within" any one of the sources?

@EssyGreen
Copy link

set of conclusions always resides within a source
I'd say "depends upon one or more sources".

You misunderstand me ... by "resides in" I mean the containing source defines the author(s)/editor(s)/transcriber(s)/interpreter(s) etc of the conclusions... I totally agree that any decent conclusion will also reference many sources in order to provide proof/evidence.

@EssyGreen
Copy link

Here's the sort of thing I mean in pseudo data:

S1 My Family Tree by Sarah Green, last edited 1st August 2012

  • P1 J Bloggs [Evidence: S2.P1]
  • - F1 Birth etc etc
  • P2 P Bloggs etc etc [Evidence: S2.P2]

S2 Birth Certificate for J Bloggs, GRO ref 12345, copy dated 1st Jan 2011 held/interpreted by Sarah Green

  • P1 J Bloggs
  • - Birth 18th March 1841 Bedminster, Bristol, England
  • P2 P Bloggs etc etc
  • P3 G Smith etc etc

S3 Bloggs Family Tree, created by F Jones, last edited 15th August 2010

  • P1 J Bloggs [Evidence S1.P1]
  • - Birth 18th March 1841 Somerset, England
  • P2 A Baggins

I've deliberately shown S3 as being out of date here ... just as it could have been if they had been referencing say an external web site. Maybe that was a complication too far and will prolly just highlight other problems but hey ho.

@jralls
Copy link
Contributor

jralls commented Aug 19, 2012

Here's the sort of thing I mean in pseudo data:
...

OK. That reinforces my point that using "My Family Tree" as a source has as much to do with its structure as it does with how the GRO formats their birth certificates: None at all. It's utterly irrelevant to both this discussion and #192.

@EssyGreen
Copy link

That reinforces my point that using "My Family Tree" as a source has as much to do with its structure as it does with how the GRO formats their birth certificates: None at all. It's utterly irrelevant

So if I published "My Family Tree" you would refuse to recognise it as a source would you?

And what's your beef with the GRO?? Their format is pretty sensible when you look at the process involved. How would you go about organising millions of BMD certificates then?

Can you at least express why you think my points are irrelevant?

@EssyGreen
Copy link

What if instead of introducing a new object, we add a new field to the SourceDescription (suggested name extractedConclusions) that is a list of references to the conclusions that are extracted from the source being described?

The problem with this is that it implies that a Conclusion can be extracted from multiple sources. This muddies what we mean by "extracted from" ... If Source S1 and Source S2 have Person P1 in their list then P1 is not an extraction but a compound amalgamation (more like a "working" conclusion)

Could we not instead have a Conclusion attribute which is a pointer to the source it was "extracted from"?

@nilsbrummond
Copy link

Here is what I am thinking:

<gedcomx-header>
   <sourceDescription id="S0" title="My Family Tree">
      <!-- source definition for this gedcom-x itself for external referrers to use. -->
   </sourceDescription>
</gedcomx-header>

<sourceDescription id="S1" title="1900 United States Federal Census Record">
   <!-- description of source 1 as defined by work done for #144 -->
   <sourceDescription id="S2" title="Record for Joshua Amis">
     <!-- description of source 2 as defined by work done for #144 -->
     <sourceDescription id="S3" title="Interpretation by Sarah Green">
        <!-- description of source 3 as defined by work done for #144 -->
        <person id="P2" title="Joshua Amis">
            <!-- data for person interpretation of person 2 contained in image of source 2 -->
           <fact id="F1" type=".../Birth" Value="1888" />
        </person>
    </sourceDescription>
  </sourceDescription>
</sourceDescription>

<sourceDescription id="S4" title="Some other source">
    <!-- description of source 4 as defined by work done for #144 -->
     ...
</sourceDescription>

<!-- Working Conclusions follow as top level objects -->

<person id="P3" title="Josh Amis">
    <!-- data for working person -->
    <source resource="P2" />
    <fact id="F2" type=".../Birth" Value="1888">
       <!-- Reference to atomized extracted conclusion -->
       <source resource="F1" />
       <!-- Reference to non-atomized conclustions:  ref the source or the analysis -->
       <source resource="S4" />
    </fact>
</person>

@nilsbrummond
Copy link

(PS: How do you paste/add your XML blocks in this interface?)

3 back-ticks, language name (XML in this case) to open

3 back-ticks to close

@EssyGreen
Copy link

Many thanks @nilsbrummond :)

I would prefer all Persons to be in a sourceDescription because this allows the researcher to then refer to them in the same way as other Persons .. For example, say I'm researching John Milsom (WP1) in one family tree (FT1) and come across two possible Census entries for him. Taking aside the interpretation of the actual Censuses for a moment, I want to be able to create 2 new family trees (FT2 and FT3) to investigate each of them (with Persons WP2 and WP3 respectively). At some point I may decide that WP2=WP1 or WP3=WP1 or neither of them or both of them. It makes life much easier if I can just cite my own research/investigation just like I would any other source.

@stoicflame
Copy link
Member

Okay, let me repeat back to you what I think you're trying to do using the XML as specified right now. (Note its a lot flatter that what you've got in mind, but I think it means the same thing.)

<sourceDescription id="S0" about="(reference to the file itself)">
  <!-- this is the description of this file so the file
       itself can be cited as a source. Note it's not referenced
       anywhere, so I'm uncertain as to the value of it... -->
  <displayName>My Family Tree</displayName>
</sourceDescription>

<!-- now we're going to describe the census record. -->
<sourceDescription id="S1" about="http://ancestry.com/path/to/census/record">
  <!-- this is the description of the census record -->
  <displayName>1900 United States Federal Census Record for Joshua Amis</displayName>
  <mediator resource="/path/to/description/of/ancestry/dot/com"/>
</sourceDescription>

<!-- now you've got another source in there entitled "Interpretation by Sarah Green"
  and I have no idea what that is, but I'll include it here for the sake of 
  completeness. -->
<sourceDescription id="S2" about="???whatisthisdescribing???">
  <displayName>Interpretation by Sarah Green</displayName>
  <!-- this source (whatever it is) was derived from S1 -->
  <source resource="S1"/>
</sourceDescription>

<!-- okay, now I've got my extracted person, Joshua Amis -->
<person id="P2">
  <name>...Joshua Amis...</name>

  <source resource="S2"/>
</person>

<!-- and now my "working" conclusion of Joshua Amis... -->
<person id="P3">
  <name>...Joshua Amis...</name>

  <!-- and I want to reference P2 as a source, but I can't do it directly
       because the source reference MUST resolve to a source
       description according to the spec. So what I have to
       do is describe P2 with yet another source description,
       S3, and reference that. -->
  <source resource="S3"/>
</person>

<sourceDescription id="S3" about="P2">
  <displayName>Conclusion about Josh Amis Extracted From Sarah's interpretation of the 1900...</displayName>
  ...
</sourceDescription>

Okay, so (like it or not), that's how it would be done with the spec as it is right now.

So back to the question at hand. There is no way to determine that P2 is an extracted conclusion of a single source. How do we need to modify the spec in order to make that determination?

The proposal I made above was to allow a new property of SourceDescription (named perhaps extractedConclusions) that was a list of references to persons in the file. So the way to determine if a person was a single-source extracted conclusion was to see if the person were referenced in an extractedConclusions list, like this:

<sourceDescription id="S2" about="???">
  <displayName>Interpretation by Sarah Green</displayName>
  <!-- this source (whatever it is) was derived from S1 -->
  <source resource="S1"/>
  <extractedConclusion resource="P2"/>
</sourceDescription>

John's question still needs to be addressed, too: how does P3 cite P2 as a source given that a source reference MUST resolve to a source description and cannot resolve to a person?

And I want to know why "S2" is even needed. What purpose does it serve? Why not just use S1?

@nilsbrummond
Copy link

And I want to know why "S2" is even needed. What purpose does it serve? Why not just use S1?

It looks to me like your S3 is the same as Sarah's S2.

The point it serves is there can be multiple interpretations. For example the Ancestry.com may come with it''s own provided extracted conclusions via the import of record model or whatever. Then the researcher may be unhappy with that interpretation and create their own.

<sourceDescription id="S1" about="http://ancestry.com/path/to/census/record">
  <!-- this is the description of the census record -->
  <displayName>1900 United States Federal Census Record for Joshua Amis</displayName>
  <mediator resource="/path/to/description/of/ancestry/dot/com"/>
</sourceDescription>

<!-- now you've got another source in there entitled "Interpretation by Sarah Green"
  and I have no idea what that is, but I'll include it here for the sake of 
  completeness. -->
<sourceDescription id="S2" about="???whatisthisdescribing???">
  <displayName>Interpretation by Sarah Green</displayName>
  <!-- this source (whatever it is) was derived from S1 -->
  <source resource="S1"/>
  <extractedConclusion resource="P2"/>
</sourceDescription>

<!-- now you've got another source in there entitled "Interpretation by Sarah Green"
  and I have no idea what that is, but I'll include it here for the sake of 
  completeness. -->
<sourceDescription id="S3" about="???whatisthisdescribing???">
  <displayName>Imported Interpretation from Ancestry.com</displayName>
  <!-- this source (whatever it is) was derived from S1 -->
  <source resource="S1"/>
  <extractedConclusion resource="P3"/>
</sourceDescription>

@stoicflame
Copy link
Member

The point it serves is there can be multiple interpretations. For example the Ancestry.com may come with it''s own provided extracted conclusions via the import of record model or whatever. Then the researcher may be unhappy with that interpretation and create their own.

Okay, so you're saying that Sarah is describing the same source in a different way? So that would be a different description about the same source. S2 would look just like S1 with perhaps a different display name, e.g. "Sarah's Interpretation of 1900 United States Federal Census Record for Joshua Amis". That's fine. What got me confused is that she embedded it within the other source description, implying there was some relationship between the two other than that they were describing the same thing.

@EssyGreen
Copy link

The point it serves is there can be multiple interpretations. For example the Ancestry.com may come with it''s own provided extracted conclusions via the import of record model or whatever. Then the researcher may be unhappy with that interpretation and create their own.

Exactly so :)

Sarah is describing the same source in a different way?

I am interpreting the image copy source supplied by Ancestry but ignoring the Conclusions supplied with it in favour of my own Conclusions.

What got me confused is that she embedded it within the other source description, implying there was some relationship between the two other than that they were describing the same thing.

I embedded it within the source for the image copy to show that it was a source of my own creation derived from it. In some situations it might be fine to just have an all-on-one source/interpretation ... but suppose the source was vague and could be interpreted different ways ... I would want the Ancestry source and 2 derivative sources one for each interpretation - both of the derivatives are "derived from" the image copy.

@EssyGreen
Copy link

John's question still needs to be addressed, too: how does P3 cite P2 as a source given that a source reference MUST resolve to a source description and cannot resolve to a person?

I agree but that is a problem with the model as it is at the moment isn't it? It's not a new problem I've introduced ... in my view a Person is a citable object so there isn't a problem.

@EssyGreen
Copy link

@stoicflame - try this - and see embedded comments

<sourceDescription id="S0" about="Research undertaken by Sarah Green on behalf of John Jones">
  <displayName>Family Tree of John Jones</displayName>
</sourceDescription>

<sourceDescription id="S1" about="http://ancestry.com/path/to/census/record">
  <displayName>1900 United States Federal Census Record for Joshua Amis</displayName>
  <mediator resource="/path/to/description/of/ancestry/dot/com"/>
</sourceDescription>

<sourceDescription id="S2" about="Preferred interpretation of S1">
  <displayName>Interpretation of 1900 United States Federal Census Record for Joshua Amis by Sarah Green</displayName>
  <source resource="S1"/> <!-- *** HOW DO I SAY IT'S A DERIVATIVE OF S1 (NOT JUST SOMEHOW REFERENCES IT)? *** -->
</sourceDescription>

<person id="P2">
  <name>...Joshua Amis...</name>
  <source resource="S2"/>  <!-- *** HOW DO I SAY THAT THIS IS EXTRACTED FROM S2 (NOT JUST REFERENCING IT)? *** -->
</person>

<person id="P1">
  <name>...Josh Amis...</name>
  <!-- *** HOW DO I SAY THAT THIS IS INCLUDED IN S0 or S100? *** -->

  <!-- and I want to reference P2 as a source, but I can't do it directly
       because the source reference MUST resolve to a source
       description according to the spec. So what I have to
       do is describe P2 with yet another source description,
       S3, and reference that. *** I AGREE THAT IS NOT GOOD *** -->
  <source resource="S3"/>
</person>

<sourceDescription id="S3" about="P2">
  <displayName>Conclusion about Josh Amis Extracted From Sarah's interpretation of the 1900...</displayName>
  ...
</sourceDescription>

<sourceDescription id="S100" about="Research undertaken by Sarah Green on behalf of Bob Smith">
  <displayName>Family Tree of Bob Smith</displayName>
</sourceDescription>

@nilsbrummond
Copy link

John's question still needs to be addressed, too: how does P3 cite P2 as a source given that a source reference MUST resolve to a source description and cannot resolve to a person?

I agree but that is a problem with the model as it is at the moment isn't it? It's not a new problem I've introduced ... in my view a Person is a citable object so there isn't a problem.

Every object with a Conclustion base should be treated as a source authored by the author of the GEDCOM-X.

I personally think the AnalysisDocument should be able to replace the interpretation sourceDescriptions in the last few examples, as long as the interpretation is by the GEDCOM-X author. I think of a SourceDescription as a description of an external source referenced; Every Conclusion as a potential internal source; And every Conclusion as a potential source referenced by a different GEDCOM-X.

@EssyGreen
Copy link

Every object with a Conclustion base should be treated as a source authored by the author of the GEDCOM-X.

+1 Absolutely :) .... but what if we import a Conclusion from Ancestry or elsewhere? Is this still a Conclusion or is it now an external source (albeit in the format of a Conclusion)?

I think of a SourceDescription as a description of an external source referenced; Every Conclusion as a potential internal source

Yup I'd go along with that tho' I think that it's important to retain the context e.g. a Role shouldn't be cited out of context of its Event; a Person should retain the context of it's Relationships etc (but I think it may be difficult/impossible to enforce this at the data structure).

the AnalysisDocument should be able to replace the interpretation sourceDescriptions in the last few examples, as long as the interpretation is by the GEDCOM-X author

I don't think the AnalysisDocument contains Conclusions and I would prefer to be able to keep the "extracted" Conclusions clean from other sources and distinct from the "compound/working" conclusion(s) if that were wished by the researcher. My reasoning is that I often find it necessary to go back and compare my working vs extracted Conclusions if there is a problem further down the line.

@stoicflame
Copy link
Member

Hello everybody. I apologize for the neglect of this issue; it's gotten cold. If you'd be willing to push this back into working memory, I'd appreciate your help getting these issues addressed.

I'll pick this back up by addressing Sarah's pass at the XML that I put out there. It was helpful, thank you. I think I understand better how you're approaching the problem. You had some questions inline there that I'd like to address:

HOW DO I SAY IT'S A DERIVATIVE OF S1 (NOT JUST SOMEHOW REFERENCES IT)?

You're saying that it's a derivative already by referencing it as a source. That's what it means to reference a source.

HOW DO I SAY THAT THIS IS EXTRACTED FROM S2 (NOT JUST REFERENCING IT)?

Exactly the question this issue originally intended to address--how to distinguish "extracted conclusions" from "working conclusions".

My proposal is twofold:

  1. Extracted conclusions are referenced as such from the source description. This is what I meant above with the <extractedConclusion> element
  2. A statement can be made that a working conclusion (e.g. "P3") is a conclusion about the same thing (e.g. person) as a particular extracted conclusion (e.g. "P2") by including an identifier for that thing on both conclusions.

Here's kind of what I mean:

<sourceDescription id="S2" ...>
  ...
  <extractedConclusion resource="P2"/>
</sourceDescription>

<person id="P2">
  <identifier>P2</identifier>
  <name>...Joshua Amis...</name>
  <source resource="S2"/>
</person>

<person id="P3">
  <!--P3 has an identifier "P2", perhaps of type "component" or something, 
       to specify that P2 and P3 are conclusions about the same person-->
  <identifier type="...Component...">P2</identifier>
  <name>...Joshua Amis...</name>
  <source resource="S2"/>
</person>

@nilsbrummond
Copy link

So extractedConclusion is, in the terms of EE, information extracted from a source for use as evidence. This evidence evaluated on it's own is the atomized extracted-conclusion. A conclusion is dependent on the analysis of all relevant extracted-conclusions.

Lets not over simplify, but be sure all the elements needed for GPS are there. I added some I thought should be there...

I believe the name element should have it's own set of evidence analysis fields as well, but left them out to keep it cleaner...

Do Conclusions and ExtractedConclusions have the same attributes, behaviors, and rules? If not then I think we may need separate classes for each.

<sourceDescription id="S2" ...>
  ...

  <!-- EE inside front cover: source form -->
  <sourceForm value="original | derivative" />

  <extractedConclusion resource="P2"/>
</sourceDescription>

<person id="P2">
  <identifier>P2</identifier>
  <name>...Joshua Amis...</name>

  <!-- EE inside front cover: informant's degree of knowledge -->
  <information value="primary | secondary" />
  <!-- EE inside front cover: evidence adequacy to answer question.  An extractedConclusion can not be negative evidence by itself.  -->
  <evidence value="direct | indirect" />

  <source resource="S2"/>
</person>

<person id="P3">
  <!--P3 has an identifier "P2", perhaps of type "component" or something, 
       to specify that P2 and P3 are conclusions about the same person-->
  <identifier type="...Component...">P2</identifier>
  <name>...Joshua Amis...</name>

  <confidence value="..." />
  <proofStatement>...</proofStatement>

  <!-- EE inside front cover: evidence adequacy to answer question. 
        Evidence can only be "negative when compared against other evidence.  So
        there must be an evidence type in the conclusion -->
  <source resource="S2" evidence="direct | indirect | negative" />
</person>

@EssyGreen
Copy link

HOW DO I SAY IT'S A DERIVATIVE OF S1 (NOT JUST SOMEHOW REFERENCES IT)?

You're saying that it's a derivative already by referencing it as a source. That's what it means to reference a source.

I think you misunderstand my meaning ... say I come across a source (S1) which is a transcription of an original. I first log S1 but then follow it up and get an image copy of the original (S2). I want to be able to say that S1 is derived from S2. This will then allow me to preserve both copies and potentially explain any anomalies, problems etc with the transcription.

@stoicflame
Copy link
Member

I added some I thought should be there...

Thanks! My comments:

  • Re: sourceForm: why do we need this? Can't we tell the source form by looking at the source? If it's an abstract, compilation, transcript, etc., it's a derivative source. If it's "material in its first oral or recorded form" then it's an original source.
  • Re: information: umm... maybe.... But I'm having a hard time coming up with a use case for needing it. And why is it in the person element? Why not on the source description element?
  • Re: evidence: Yes, I think this is needed on the source reference as an attribute (as you have it). I think your suggestion might be a good solution to support for modeling "negative" statements #127, so we'll address it there. But I'm struggling to understand how it makes sense in a source description element since (as you mention) it's only relevant in context of a research question and compared to other evidence.
  • Re: confidence: Agreed. I think it needs to be moved out of attribution on onto the conclusion. We'll address that at What exactly is "Attribution" for, and what Classes need one? #192.
  • Re: proofStatement: I think we said that an AnalysisDocument serves as the proof statement for a given conclusion. So to provide a proof statement for a (set of) conclusion(s), you reference an analysis document as a source. So I don't think this element is needed within a conclusion.

@stoicflame
Copy link
Member

I want to be able to say that S1 is derived from S2.

Umm... but it's not derived from S2. You said it was transcribed from an original. And S2 is also derived from that same original. So you need another description of the original (S0) and both S1 and S2 would reference S0 as a source.

If you wanted to create a transcription of the image copy of the original (i.e. a transcription of S2), then your description of that transcription would reference S2 as a source.

What am I missing?

@stoicflame
Copy link
Member

I've attached the discussed changes to this thread and I'm awaiting your comments. In summary, the changes include:

  • Adding an extractedConclusions property of type URI to SourceDescription.
  • Adding an additional identifier type, http://gedcomx.org/Evidence, to identify the evidence extracted from sources that supports the conclusion.
  • Adding an example to the conceptual model that illustrates how to model extracted evidence.
  • Updating the UML and the Java code.

@EssyGreen
Copy link

I want to be able to say that S1 is derived from S2.

Umm... but it's not derived from S2. You said it was transcribed from an original. And S2 is also derived from that same original. So you need another description of the original (S0) and both S1 and S2 would reference S0 as a source.

S1 says it's a transcription of the original (ie it is derived from S2 which at this point is not yet seen/assessed/seen/evaluated). Although theoretically you are correct that S1 may have been made up or transcribed/translated from another source etc until I see the original for myself I have to believe the details provided by the author/provider of S1.

So I get the original and take a scan ... that's my S2. Why would I create yet another source? It is the one that S1 said it was transcribed (i.e. derived) from. If I create an S0 then I am assuming that S1 was transcribed from some other source than the one it said it was!

@stoicflame
Copy link
Member

S1 says it's a transcription of the original (ie it is derived from S2 which at this point is not yet seen/assessed/seen/evaluated).

Wait, that statement is in conflict to me. Is it derived from the original, or is it derived from a copy of the original? I'm good either way, just choose so I can tell you how to model it.

If it's derived from the original, then you need to describe the original (i.e. create an S0) and reference the original as a source.

If it's derived from a copy of the original, then you need to describe the copy (i.e. S2) and reference S2 as a source.

What I'm really trying to do is identify how you think we're not fully accounting for the notion of "derived from".

@EssyGreen
Copy link

S1 says it's a transcription of the original (ie it is derived from S2 which at this point is not yet seen/assessed/seen/evaluated).

Wait, that statement is in conflict to me. Is it derived from the original, or is it derived from a copy of the original?

As a researcher finding the source, how would I know? I would trust that it came from the original but the transcriber may easily have been copying from fiche or whatever.

However, all this is somewhat irrelevant ... what I want as a researcher is to be able to make an explicit and unambiguous relationship between S1 and S2 which is the equivalent of saying "S1 is a derivative of S2". As far as I can tell the only means I have of getting close to that in GEDCOMX is by using the generic "sources" list to cross-reference them. Since this is a generic list it is ambiguous.

Similarly, in other situations, I want to be able to make an explicit and unambiguous statement that is the equivalent of saying "S1 is a component part of a larger source (collection) S3". Again in GEDCOMX the only way to do it is to use the generic sources list ... and hence again this leads to ambiguity.

If I look at the sources list for any particular source then it is impossible to deduce any meaning from it except that they are somehow related. In my opinion this renders it useless.

@stoicflame
Copy link
Member

As a researcher finding the source, how would I know? I would trust that it came from the original but the transcriber may easily have been copying from fiche or whatever.

Oh, so you're identifying a third case: you don't know where it came from. That's fine. So why do you want to say that you do know where it came from? Just leave the sources list empty, for now, until you know where it did come from at which point you model its source.

I'll tell the story. A researcher finds a transcription and describes it with S1. She doesn't know where it's derived from, so she leaves the source list empty. Later, she finds an image that she describes with S2 and she decides either (a) the transcription was derived from the image or (b) the transcription was derived from the original just like the image was derived from the original. If (a), the modifies S1 to reference S2 as a source. If (b), she describes the original with S0 and references it from the source list of both S1 and S2.

what I want as a researcher is to be able to make an explicit and unambiguous relationship between S1 and S2 which is the equivalent of saying "S1 is a derivative of S2". As far as I can tell the only means I have of getting close to that in GEDCOMX is by using the generic "sources" list to cross-reference them. Since this is a generic list it is ambiguous.

I disagree. It's not ambiguous at all. You make that clean and unambiguous statement by stating that the source of the transcription described by S1 is the source described by S2.

Similarly, in other situations, I want to be able to make an explicit and unambiguous statement that is the equivalent of saying "S1 is a component part of a larger source (collection) S3".

Thank you for articulating the other case that you've got in mind.

Again in GEDCOMX the only way to do it is to use the generic sources list ... and hence again this leads to ambiguity.

This is incorrect. When a researcher wants to describe a "component of" relationship, the source description provides a componentOf property (which is of type SourceReference) to reference the description of the source (i.e. S3) of which S1 is a component. So the "component of" reference does not belong in the sources list. Does that help relieve some of your concerns?

@EssyGreen
Copy link

Oh, so you're identifying a third case: you don't know where it came from.

Oh jeez this is really getting a bit too pedantic. Look, whenever I get a source I read to say where the supplier/publisher said it came from. But I don't know it for a fact unless I go the leg work myself. In the same way I can't be sure they did a good job of transcribing it until I see the original. But in spite of this I want to be able to document it so that I can follow it up.
Your example doesn't mean anything to me I'm afraid. Let's just forget the whole thing.

When a researcher wants to describe a "component of" relationship, the source description provides a componentOf property (which is of type SourceReference) to reference the description of the source

That's great - I hadn't realised that was in there.

@stoicflame stoicflame closed this Oct 5, 2012
@jralls jralls mentioned this pull request Mar 29, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants