Generation mode #1146

ampli · 2021-02-27T17:22:52Z

ampli
Feb 27, 2021
Collaborator

Previous discussion on that at these places:

PR "count.c fixes" starting at count.c fixes #1143 (comment).
Discussion thread "Metatopic: Next Steps, Strategic Plan", starting at Metatopic: Next Steps, Strategic Plan #1134 (comment).
(I will append here more places if/when I locate them.)

ampli · 2021-02-27T17:33:06Z

ampli
Feb 27, 2021
Collaborator Author

FYI, I just now started a command-line tool for generation; it's in my repo, branch "generate"

How can we maintain it?
Should I send you PRs, or just push commits to my branch and you periodically pull from it?

Also, I would like that it would be based on the master branch, so all library future improvements will get to it, and also so we will be able to merge it eventually with the master branch without conflicts. It means every PR to the master branch should also be applied to the "generate" branch (even if it is totally unrelated, like applied to link-parser).

0 replies

ampli · 2021-02-27T18:03:04Z

ampli
Feb 27, 2021
Collaborator Author

// Force the system into generation mode by appending "-generate"
// to the langauge. XXX this seems hacky, need a better API.

Originally, I intended to trigger the "generating" mode using "-test=generate". However, I then noted that in the PR for the improved panic mode I damaged the ability to set the test parse-option (and the debug and verbosity parse-options too) before opening the library (fixed later by PR #1144). I then used the "dict-generate" hack since I couldn't find another way to pass an indication for that to the dictionary-reading code.

Suggestion for a less-hacky API for that:

Use -test=generate. I introduced the -test argument for passing experimental options to the library, without a need to add a zillion of additional parse-options.) This can later be converted to a parse-option if desired (but there may no need for that).
Use a new parse-option generate, which may have Boolean value, or may be better - a string value that can be set to sub-options. But unless (3) is implemented, it will need to set a global (possibly thread-local) variable.
Add something like dictionary_create_with_options(), that well get Parse_Options as 3rd argument. Or maybe better yet, a new options type Dictionary_options, that will allow replacing all the other dictionary_create_*() calls.

I propose to use option (1) for now.

0 replies

ampli · 2021-02-27T18:13:27Z

ampli
Feb 27, 2021
Collaborator Author

// Set the number of words in the sentence.
// XXX this is a hacky API. Fix it.

Suggestion for a better API:

Curreently, in generation mdoe, sentence_create() uses the number provided in the "sentence" to create a new sentence that contain the requested number of wildcard-words (represented as \*). Instead, generate.c can do it by itself.
Use -test=generate:N.

I propose to use option (1). It is also somewhat more flexible in case you would like to generate sentences with certain words, like only sentences ending with a period.

6 replies

ampli Mar 10, 2021
Collaborator Author

John told Mary that * * * on the *

Exactly. This would work.

In the old demo (on which I wrote in the discussion group), something like ... on the \*^Cn-u (when ^C is SUBSCRIPTMARK or even ... on the w\* was possible too, since I used dictionary_lookup_wild() for words containing *'.
The current generate code inherited it, and it includes this call in case a word includes `*' (but is not only '*'). However, it cannot work now without further modification (likely small ones) since the rest of the code deals only with categories.

linas Mar 10, 2021
Maintainer

Yes, something like that would be nice. Possibly something like ^C<UNKNOWN-WORD>.v or .. I dunno. Some way of limiting the selection to just nouns or just verbs. I'm not working on sentence generation right now, so something, anythhing, a place-holder, will do for now. The long-term vision would be to be able to accept, as input, some structured information in MTT form ("Meaning-Text Theory") and generate grammatically valid sentences as output.

If you want to work on this, we can talk more. The right way to start would be a top-down design: figure out how to take MTT-style input and convert it into what LG can work with.

ampli Mar 11, 2021
Collaborator Author

Yes, something like that would be nice.

II don't see a problem to restore this previous behavior.

If you want to work on this, we can talk more.

There is a need to have a feature-annotated English word dictionary.

ampli Mar 11, 2021
Collaborator Author

I have just modified link-generation.c to generate a sentence pattern using wildcard words according to the given sentence length. I would also like to add the ability to set the sentence pattern to an arbitrary one.

Possibilities:

On -l 0, read a sentence pattern from stdin (involves fixing the argument help text to say that).
New argument -p (--pattern) with a sentence pattern as an argument (ignoring -l in that case).

(1) is simpler, but maybe hackish.
Of course, implementing any of these options is easy, so I don't care - just please tell me what you prefer.

ampli Mar 12, 2021
Collaborator Author

New argument -p (--pattern)

Or maybe -t (--template)

linas · 2021-03-09T02:56:18Z

linas
Mar 9, 2021
Maintainer

Continuing discussion from pull req #1143

more than one random sampling of each "category"

Oh, OK. Interesting. One is enough, but it should not be a "random sampling", it should really output a category label. This allows some later stage to pick the right word from that category: e..g. by uniform random sampling, Zipfian random sampling, or by using some external info to select a specific "synonym" in that category. The tricky question is "how to print that category label". I can avoid this question in the short-term, by using dictionaries that have only one unique word per category.

how to send fixes and maintain the generate branch.

I'll move the "generate" branch over to this repo. Then I think you can do pull reqs on it (I guess, I've never done that before). Another possibility is you can pull my generate branch and apply fixes to that, and we can ping-pong in this way. I don't really care; whatever seems easiest for you.

8 replies

ampli Mar 11, 2021
Collaborator Author

Well, I guess if it was json format, that would be OK...

There is no problem to provide that. This way the disjuncts (including their multiplexed costs) can be easily included too in case there is a need to use them.

the intent is to eventually merge the branch into master

Hence I would like to pedantically keep it identical as possible to master. If they become diverse, their merging could be a nightmare.

As I said, I would like to force-push a master rebased version soon, and I will notify you when to make the pull -f for it (you can "back up" your local branch beforehand, just in case, by applying a tag).

ampli Mar 11, 2021
Collaborator Author

I'll move the "generate" branch over to this repo. Then I think you can do pull reqs on it (I guess, I've never done that before). Another possibility is you can pull my generate branch and apply fixes to that, and we can ping-pong in this way. I don't really care; whatever seems easiest for you.

As a test I made a PR for branch generate that contains a single commit from master (0644747) that is not yet in it.
Now I have these options to get it into generate:

Merge
Squash as a single commit
Rebase

Updating by rebasing is ideal for me, but you will need to pull -f it.
Merging is fine too, but I think we may get numerous extra merge records when we finally merge generate into master.
Adding all commits (in general - in this particular test there is only one commit) seems fine to me, but I don't know if it would not affect the commit authoring info (I guess it would not because this code already appears in master, but I have never tested that).

So the question is which option to use.

ampli Mar 11, 2021
Collaborator Author

Since I'm going to do a force-push of my initial fixes, rebased on master, would it be OK with you that I also arrange/collapse the commits to have a cleaner history, that also don't include bad/hacky tries?

ampli Mar 12, 2021
Collaborator Author

Squash as a single commit

I applied it this way. As expected, it got applied only to the generate branch. My name was added as a commiter but I guess this metadata change would not be applied to master on a merge (because this commit is already there).

linas Mar 14, 2021
Maintainer

1 Merge
2. Squash as a single commit
3. Rebase

I mildly prefer 1. but only because it preserves history (mistakes, bad ideas and all). Merging preserves all authorship, all date-stamps, everything. It seems easy and uncomplicated to me. But if you really want to/need to do it some other way, I guess that's OK.

ampli · 2021-03-10T03:12:59Z

ampli
Mar 10, 2021
Collaborator Author

There is now a branch opencog/link-grammar named generate

First thing, it is needed to rebase it on master. Currently, I don't have any idea how to do that ... .
We will then need to rebase it on master whenever we apply new commits to master. I have no idea how to handle conflicts in that case.

3 replies

ampli Mar 10, 2021
Collaborator Author

There is now a branch opencog/link-grammar named generate

First thing, it is needed to rebase it on master. Currently, I don't have any idea how to do that ... .
We will then need to rebase it on master whenever we apply new commits to master. I have no idea how to handle conflicts in that case.

I fetched the new generate branch (replacing my local one).
By now there are many changes in master that are not in it.

Here is the workflow I propose:

To start "cleanly", I propose that I force-push a one-time rebase on master, and then you make a one-time git pull -f to get it - I suppose you don't have local changes just now.
(Before doing a force-push, I will push a commit to generate to validate that my setup is correct.)

From that time on, I will take care to apply the master commits also to my local generate branch, and commit to the repository's generate` branch if I needed to integrate them.

There are some changes in the generate branch that it is a good idea to apply to the master just now. For even a cleaner start I will create PRs for them so you can apply them to master before I make the initial rebase I propose above.

I will try to not make conflicting changes in master until we decide to merge the generate branch with it(besides urgent bug fixes if needed).

Does all of that seem fine to you?

linas Mar 14, 2021
Maintainer

I merged master into generate a few days after you wrote the above. I can see that you found this, and are using it.

I don't see any particular reason to restart cleanly. I am not worried about merge conflicts; they can sometimes arise, but are usually quite easy to fix. As far as I'm concerned, both master and generate can evolve independently, and I can merge master into generate every now and then, whenever it seems needed.

ampli Mar 14, 2021
Collaborator Author

OK.

ampli · 2021-03-10T03:48:44Z

ampli
Mar 10, 2021
Collaborator Author

I said:

The stability problem can be solved as follows:
Along with the generated sentence output (using categories labels), the program would also produce a directory of files whose names are categories and their content are words belonging to these categories. An external program would then use a word-frequency table to generate the words.

However, it turned out it is not that simple, since the disjunct eliminating code (in the generate branch) also multiplexes many categories (and drastically reduces the number of disjuncts by that). So the random word plugging first involves randomly selecting a category within a disjunct, and only then randomly selecting a word within a category. This makes it more complex (but of course it is still possible) to implement code that forces desired word distribution.
Also, should the sampling be depended on costs? (Disjunct costs are not implemented yet in the generate branch.)

28 replies

ampli Mar 20, 2021
Collaborator Author

both dict-api.h and dict-structures.h are installed. So the new API should go into these two files.

I encountered a problem when trying to include these files in link-generate.c:
They include other files with paths that are correct only when they reside in the system include directory under link-grammar, or when CPPFLAGS of link-generate.c is similar to the one used for library compilation. It means that I would need to use this CPPFLAGS and the include paths will not be the same as used to include the system files.

linas Mar 20, 2021
Maintainer

However, dictionary_lookup_wild() is not find in link-grammar.def. ....not exported on Windows. ...

Right now, I don't care about these inconsistencies. They can stay that way longer. What was needed was a way for the opencog NLP subsystem to get access to the LG Exp structures. There are no other users, which is why I don't care about them. The opencog NLP subsystem uses the LG Exp structures to try to generate sentences ("sureal" and "microplanner")

linas Mar 20, 2021
Maintainer

Shouldn't category_name be just name?

Yes.

I encountered a problem when trying to include these files

What if the includes were changed to use angle-brackets instead of quotes? e.g. #include <dict-structures.h> instead of #include "dict-structures.h" ?

ampli Mar 20, 2021
Collaborator Author

Right now, I don't care about these inconsistencies. They can stay that way longer. What was needed was a way for the opencog NLP subsystem to get access to the LG Exp structures. There are no other users, which is why I don't care about them. The opencog NLP subsystem uses the LG Exp structures to try to generate sentences ("sureal" and "microplanner")

My current implementation has a big problem in providing useful expressions: These are expressions as they appear in the dict, i.e. with unresolved symbolic dialect components. (It also means that old projects that use dict expressions would misbehave and even may crash if they would use a dict with dialects.)

I see two possibilities to solve that:

As I said in Generation mode #1146 (reply in thread), quoted below:

Exp already is in the public

Note that they may contain dialect components and there is no public API to convert them to fixed costs.
Something like this: lg_exp_set_dialect(Exp *e, Parse_Options opts).

It should actually be:

Exp *exp_with_dialect_resolved = lg_exp_set_dialect(const Exp *e, const Parse_Options opts);
void free_Exp(Exp *e);

Make the API call Category* dictionary_get_categories(Dictionary dict); to return a copy of the Category array, in which the expression dialect is resolved.

I propose to use possibility (1), as it is much easier to implement and is more useful and flexible. It is also much more efficient (no need to pre-resolve all the expressions, no need to provide a copy).

ampli Mar 20, 2021
Collaborator Author

What if the includes were changed to use angle-brackets instead of quotes? e.g. #include <dict-structures.h> instead of #include "dict-structures.h" ?

This doesn't help, as the needed include files contain references to other include files. This is working fine if all are in the same directory, like in the system include directory under link-grammar. It can be of course fixed but we will need to find out how to do that.

linas · 2021-03-14T04:16:00Z

linas
Mar 14, 2021
Maintainer

Dictionaries stored in sqlite3 don't seem to work. The data/demo-sql contains a dictionary that will parse this is a test so I expect at least that to be generated. But it isn't: I tried link-generator -l data/demo-sql -s 5 -c 50 (and also 4 and 6 and 7) and get nothing. So ???

5 replies

ampli Mar 14, 2021
Collaborator Author

The category data structure is being built in read-dict.c, and thus it is blind to the content of a sqlite3 dict.
For a sqlite3 dict, there are these possibilities:

To dump it to a file dict.
To add a function that iterates the sqlite3 dict and builds the category data structure.

I guess that option (2) is desired. But I don't know anything about the structure of the sqlite3 dict.
Questions:

Is it also organized according to word categories?
Do you have C code to convert it to a file dict (so I can reuse it for building the category data structure).
Is there up-to-date documentation for it?

ampli Mar 14, 2021
Collaborator Author

BTW:
I just now fixed a bad bug that caused the category word list (for each category) to get truncated.
I found it after I implemented the JSON category dump.

This was the bug fix:

--- a/link-grammar/dict-file/read-dict.c
+++ b/link-grammar/dict-file/read-dict.c
@@ -1884,11 +1884,11 @@ static bool read_entry(Dictionary dict)
                dnx->exp = n;
                i++;
        }
-       dict->insert_entry(dict, dn, i);
-
        if (IS_GENERATION(dict))
                add_category(dict, n, dn, i);
 
+       dict->insert_entry(dict, dn, i);
+
        if (dict->suppress_warning)
        {
                free((void *)dict->suppress_warning);

But the word selection still doesn't look like a random one at all. So there is at least one more bug in this regard. I will update the GitHub branch after I solve that (hopefully tomorrow).

linas Mar 14, 2021
Maintainer

I can "easily" dump my data for a file; I think I need to write that code anyway, for other reasons.
I think the sqlite backend is already organized according to "categories" (assuming we agree about the definition of a "category", in comments above. Conceptually, the sqlite data layout is "almost identical" to the flat file layout. So making it work wouldn't be that hard.
If my description in Generation mode #1146 (reply in thread) is correct, then ... I think its smooth sailing. With those 4-5 API functions, everything becomes "easy", I think.

ampli Mar 14, 2021
Collaborator Author

I'm happy to merge generate...ampli:generate as it stands right now.

I will push to my repository a new one in a few minutes.
In addition to bug fixes and other improvements, there is also a possibility to dump the dict categories in a readable JSON format::

echo / | $LGT/link-parser/link-generator -l en -s 0 --test=dump-categories

-s 0: Read a sentence template from stdin.
echo /: A bogus sentence template with no linkages, just to get only the dump-categories result.

BTW, python -m json.tool says on the produced output:
Invalid \uXXXX escape: line 11 column 10 (char 204)
I don't know why.

linas Mar 14, 2021
Maintainer

json

I have no plans for using json for anything ...

sql

If you look at link-grammar/dict-sql/dict.sql then you will see that the sql dict already uses word categories; they are given explicit names: classname. I will try to add generation support to the sql backend now. I think it will be easy.

I will push to my repository

Open a pull request whenever you are ready. I don't want to guess when that might be.

ampli · 2021-03-14T06:28:58Z

ampli
Mar 14, 2021
Collaborator Author

I can "easily" dump my data for a file; I think I need to write that code anyway, for other reasons.

So I guess I can forget it for now.

2 replies

ampli Mar 14, 2021
Collaborator Author

Reading the code, I see that you have a typedef category_and_cost and I guess we could use that in the API instead of what I proposed above.

It is possible, but then we expose the API user to internal implementation details.

ampli Mar 14, 2021
Collaborator Author

json

I have no plans for using json for anything ...

So I should remove it before I send the next pull request. Please tell me if this is desired.
However, it seems useful for "something".
Here is the output for the en dict: en-categories.zip

Open a pull request whenever you are ready. I don't want to guess when that might be.

I will send a PR after you decide whether I should remove this code (for that I just need to discard the last commit).

ampli · 2021-03-15T00:43:47Z

ampli
Mar 15, 2021
Collaborator Author

(Starting a new thread for that topic.)

Given a word, return a list of all word-categories it belongs to.

A word can indeed appear in several categories, especially with different subscripts.
Say you know that its frequency in the language (in its form w/o subscript, of course) is X percent.
You fetch a linkage in which one of the disjuncts is of a category to which this word belongs (and this disjunct is also derived from several other categories).
How can you calculate the chance of this word to represent this disjunct (and thus emitted in the output sentence)?

5 replies

linas Mar 15, 2021
Maintainer

Good question. You might not like the answer, which is that probablilities are proportional to exp(-cost) (or 2^{-cost} if cost is meaasured in bits instead of nats.) I don't know what the frequency of a word is in a language, for several reasons:

I don't bother to actually count
Because knowing this number has never seemed useful
It's highly dependent on the speaker, the corpus: e.g. frequency of obscure words is much higher in a technical manual, than in common speech. And, as I found out, words like "walk, run, jump, hit shoot, ride" are very (very) rare in wikipedia, even when they are common in outdoor activity.
So in the above sense, word frequency is "not knowable", because there is no such thing as an average or typical corpus which one can sample.

However, what I do know, to fairly good precision, is the ratio of different disjuncts for a given word. For any given word, I do count how many times each different disjunct is observed. This is central to the learning process. This raw count is not useful by itself; it has to be processed to obtain the mutual information (MI) between a word and a disjunct. The MI is interpreted as being minus the cost, and so the "most correct" sentences are those with the highest total MI (least cost).

FYI, for word w and disjunct d, the MI(w,d) is

                  N(w,d) N(*,*)
MI(w,d) = log_2  ----------------
                  N(w,*) N(*,d)

where

N(w,d) = number of times disjunct d has been observed on word w
N(w,*) = sum over all dofN(w,d)`
N(*,d) = sum over all w of N(w,d)
etc.

Equivalently

                  P(w,d) 
MI(w,d) = log_2  ----------------
                  P(w,*) P(*,d)

where the P's are probabilities: i.e. P(w.d) = N(w,d) / N(*,*)

Thus, given a sentence and a parse (a sequence of (w,d) pairs) the MI of the sentence is the sum of the MI of each pair. The likelihood or "probability" of that sentence is proportional to 2^{total MI}

FWIW, individual MI's range from +30 to -10, form a bell curve with about +10 in the middle. ... they conform to the bell curve quite nicely, especially at the tails. Pairs with an MI below about +4 are "junk" or "nonsense" or "meaningless" (low-quality), while those with high MI are "very precise". Examples of high MI come from geographic place names, e.g. "Emek Ayalon Bridge" -- you will almost never see the word "Emek" by itself, it will almost always be followed by "Aylon Bridge" so the disjunct to match that will have a very high MI (very low cost) By contrast, nonsense sentences might be parseable, but will usually be filled with "junk disjuncts" and have a low or negative MI (very high cost)

Oh, and FYI, all this has been measured and works well for English (I've measured for disjuncts, other people have measured other variations over the last 2-3 decades). The goal of the current language learning project is to improve the overall accuracy by a factor of 10x. That is, the above "works well", it just doesn't "work really really well".

ampli Mar 15, 2021
Collaborator Author

A disjunct is typically derived from several categories, and first, you have to choose a category from them. According to your explanation, you know the chance of each disjunct to appear. What I particularly wondered is:

How do you know the chance of it being derived from a particular category (of the ones from which the said disjunct is derived)? I now understand you know that from your knowledge how many times this disjunct appears in each category, and the total number of disjuncts in each category. (EDIT: The counting is done in the corpus, not dict.). Is this correct?
Say you selected a particular category, and now you need to select a word within it. This is a particular meaning of the word, but the disjunct can be shared with other meanings of the word. How do you know the chance of this word to appear?

linas Mar 15, 2021
Maintainer

During learning, categories are learned. Initially, there are only pseudo-disjuncts; connector types are learned by clustering. For example, the corpus: "the dog saw the cat. The dog chased the cat." results in two psuedo-disjuncts:

saw:  dog- & cat+;
chased: dog- & cat+;

From this, one can deduce that "saw chased" belong in the same word category. Correctly clustering different word meanings is harder. Consider a corpus that contains "He will saw the wood". The psuedo connector-sets are now:

saw:  (dog- & cat+) or (he- & wood+);
chased: dog- & cat+;

It is no longer obvious that "saw chased" belong in the same category. It's now a judgement call. The current code base notices that there is an overlap, and if the overlap is large enough, it will subdivide the list of pseudo-disjuncts, clustering like so:

chased saw.a: dog- & cat+;
saw.b:  (he- & wood+);

If the corpus contains "he cut the wood", one can now deduce

chased saw.a: dog- & cat+;
saw.b cut:  (he- & wood+);

Determining when to merge, when to split is both algorithm specific, and are controlled by tunable parameters. The goal is to find the best algorithm the best parameter tuning.

Anyway, the above is for learning, not for generation. There are two very different generation modes. One is to generate a random corpus. In this case, I start by generating a random dict. Start with N words and C connector types (N and C are a user-settable parameter) Create random disjuncts of length 1,2,3... with some probability distribution (a user-settable parameter). Assign an average of A words to each disjunct (A is a parameter) etc. a few more parameters. Right now, this random-dict generator is not assigning any costs. Then, given such a random dict, create a random corpus of M sentences, (with some tuneable probability of short vs. long sentences.)

The initial goal is to generate a corpus that is "good enough" for initial testing and learning-pipeline bringup. What we have now is more-or-less good enough for that (and I have fully automated most of the pipeline). I would like to add costs, and make other tweaks, but first priority is to get learning working well with the current set of tuneable parameters.

The long term goal is to generate "random grammars that are just like English" but I don't yet know how to measure if a grammar is "just like English" (or "just like Russian", with morphology, or "just like Hebrew", with the things that it does.). If I can prove that (a) the learning system can accurately learn random grammars and that (b) the random languages are just like natural language X then I can claim that the learning system can accurately learn natural-language X.

So that's random generation. The other, completely different kind of generation, is to create sentences from a bag of words and features: e.g. create a valid sentence that contains the words "pizza, Ben, eat", make the verb be past-tense, and be sure to say that there were two pizzas (if possible) and that they were large (if possible). And, if you can keep the sentence short, then also say that this happened on a Sunday. Otherwise create two sentences. So this is the other kind of generation: we know approximately the kinds of topics to be discussed. An approximate priority for those topics. Some limitations about time and length (write a two page short story about Ben eating pizza; I can supply you with a network of interconnected facts about where, when, how all of this happened.) -- the trick is to find grammatical word-sequences. i.e. sentences that express this network of interconnected facts. This is the other kind of generation. ...

(There is some existing code that attempts to do this second kind of generation, but it does not work well. (for many reasons) and just right now, no one is actively working to fix it.)

ampli Mar 15, 2021
Collaborator Author

To sum up:

I will include in link-generate the "dummy" word-plugging code, to be modified by you.
I will think about "The other, completely different kind of generation" that you describe above.

ampli Mar 17, 2021
Collaborator Author

This is the other kind of generation. ...

As I said in another post here, I will need for that en English word features dictionary. Do you know about a free one?

(There is some existing code that attempts to do this second kind of generation, but it does not work well. (for many reasons) and just right now, no one is actively working to fix it.)

I intend to look at that after we merge the generation branch.

I also have an idea of a very fast generator for sentences with some fixed words and intervening variable wild-card words.
It needs more deep changes in the code, to the degree that it would not be efficient to use the same functions for regular linkage too. But most of the code would still be the same, so tricky C-preprocessor tricks would be needed to still use the same code (maintaining different code, when most of the code is the same, would be a nightmare)_.

Anyway, the idea is to first fully parse a long-enough sentence, wholly consisting of wild-card words. This may take much time and memory (and may be considered as some kind of initial learning or training). Then the connector table would be copied to a new one, without the zero count entries, and without the count fields, so its size would be very much smaller. The result connector table, along with the disjuncts of the wildcard words would then be dumped to a file, to be reread on program start. When a sentence template is then given, the program would prune these disjuncts according to the template disjuncts, and mk_parse_set() would perform recounting (this pruning is needed in order to preserve the original tracon numbers). More optimizations could be done, like renumbering the tracons Of course, without actually implementing it I cannot know if there are fatal holes in this design...

linas · 2021-03-15T04:30:55Z

linas
Mar 15, 2021
Maintainer

Bug: it seems that dict->num_categories is never initialized, and it seems that there is an off by one error: in add_category() you start at one; category number 0 is never used. (you respect this off-by-one in the free()) Can we start at zero?

10 replies

ampli Mar 15, 2021
Collaborator Author

This if statement doesn't seem to be correct ...

This is clearly a bug. Hence it happened on the word "a". I guess that "10" category was then due to bogus values seen by the incorrect union member.
The intention was to differentiate between disjunct of fixed words and disjuncts of categories.
An indicator should have been set earlier, in build_word_expressions().
A new disjunct field will be needed for such an indicator unless a hack is done and I still use string[0] for it. For now, my solution is to put there a blank, so the check will be if (string[0] != ' '). Later when the code stabilizes we can see if and where to create a new Disjunct struct field. I will send a PR right away so we can continue to develop the code on the same basis.

ampli Mar 15, 2021
Collaborator Author

I just sent PR #1161.
Note that the result of This is a test is w/o subscripts for en but with subscripts for demo-sql. I think this is because you don't set dict->leave_subscripts and it happens to have a random non-zero value.

ampli Mar 15, 2021
Collaborator Author

For me, IS_GENERATION is true, because I always load the categories

Note that currently if IS_GENERATION is true, the library cannot do normal parses because it skips doing things needed for them and instead do things needed for a generation.
For example, currently, it doesn't call sane_linkage_morphism(), and it also supposes there are no null links.
However, not doing the things that are needed for a normal parse is actually a bug if fixed sentence words are allowed in generation mode.
I will try to fix this bug.

For the generation mode to be always enabled, the categories should be loaded unconditionally, and IS_GENERATION should be changed to return true iff the sentence includes a wild-card \*. But as I said elsewhere here, this has the drawback of making a bad surprise if a normal sentence happened to include \*.

linas Mar 16, 2021
Maintainer

Always building the categories was scaffolding, while I wrote the code. I changed it to test in the same way that dict-file does.

linas Mar 16, 2021
Maintainer

I added the one-liner dict->leave_subscripts = test_enabled("leave-subscripts");

linas · 2021-03-15T05:27:52Z

linas
Mar 15, 2021
Maintainer

For each category, it seems you store the word, and not the subscripted word. Why? Does it matter? Is something easier, if done one way and not another?

4 replies

ampli Mar 15, 2021
Collaborator Author

I actually store the subscripted word. You see unsubscripted words in sentence output because by default I remove the subscript when I plug a word into a category. You can use --test=leave-subscripts when you invoke `link generator to prevent the removal of subscripts upon word plugging.

Should I remove the word plugging - since it will be done by link-generator? Or should it just be optional?

linas Mar 15, 2021
Maintainer

The word-plugging should be done in the generator.

ampli Mar 15, 2021
Collaborator Author

The word-plugging should be done in the generator.

OK, I will write the API and will move the word-plugging as-is to link-generator as an example of which data structures to use.

BTW, the "problem" is that almost surely some of the API (at least) will be a false-start and this may add many dummy commits until it gets stable. The word plugging will also need a total overhaul, resulting in more commits. Usually, you don't see such commits in the PRs I send, as I reorder and squash commits before I send a PR, else it would have included zillions of obfuscated commits. Especially, usually, I don't leave commits that have known buggy code - I just squash them with the fix. Else it may be harder to bisect.

linas Mar 15, 2021
Maintainer

I mostly don't care about buggy commits and false starts. When I code for myself, I don't bother to reorder and squash; it takes too much energy. For large projects, such as the Linux kernel, yes, having very clean and logical commits that appear in the right order is very important. But given the nature of this project, its just us two, I'm not concerned. Besides, maybe some future superintelligence will analyze the commit history to create a scientific theory of how human beings design code.

ampli · 2021-03-16T22:09:53Z

ampli
Mar 16, 2021
Collaborator Author

@linas,
I need your input regarding these issues to reduce the need for renames at a later stage:
#1146 (reply in thread)

2 replies

linas Mar 17, 2021
Maintainer

Done

linas Mar 17, 2021
Maintainer

p.s. I'm putting a thumbs up every time I think a conversation is probably finished. That means I will sometimes thumbs-up my own posts. But I'm not doing this consistently; mostly its so I can keep track of what I've already read vs. what I haven't read yet.

Feel free to continue any "finished" conversations. Feel free to use reactions in the same way.

I wish there were more reaction types, like a check-mark or something.

linas · 2021-03-17T04:03:39Z

linas
Mar 17, 2021
Maintainer

Bug: With the SQL-backed dict, I am seeing sentences with two or three left-walls in them, even though left-wall has no left-facing connectors in them ...

1 reply

linas Mar 17, 2021
Maintainer

But I cannot reproduce this bug with a flat-file equivalent dictionary... so it's something else.

ampli · 2021-03-17T04:13:38Z

ampli
Mar 17, 2021
Collaborator Author

You should not put them in the categories data structure. (For the dict file, they also don't include macros/regex labels and limit-directives). בתאריך יום ד׳, 17 במרץ 2021, 06:03, מאת Linas Vepštas ‏< ***@***.***>:

…

Bug: With the SQL-backed dict, I am seeing sentences with two or three left-walls in them, even though left-wall has no left-facing connectors in them ... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1146 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGWXGRZ66GDZAZUKNJCJTTTEAS2VANCNFSM4YKEE45Q> .

3 replies

linas Mar 17, 2021
Maintainer

oh, OK. ... How did you guess?

linas Mar 17, 2021
Maintainer

.. I believe you, and I'll fix this tomorrow ... but how did that make a difference? Why would this mistake cause LEFT-WALL to appear two or more times?

ampli Mar 17, 2021
Collaborator Author

How did you guess?

This happened to me too before I excluded them...

If we would like to include anything (i.e. walls, length-limit directives, macros) in the category data structures, we will need a flag field to signify that such words should not be used for a wild-card.

Note that I still include idioms, but ~~post-processing~~ external processing would be needed to separate their words.

ampli · 2021-03-17T04:39:11Z

ampli
Mar 17, 2021
Collaborator Author

Why would this mistake cause LEFT-WALL to appear two or more times?

If you look at the disjuncts you will see these sentences still have a full linkage.

2 replies

linas Mar 17, 2021
Maintainer

Indeed they do. I learn something new every day.

linas Mar 17, 2021
Maintainer

For your (and my own) entertainment:

linkparser> LEFT-WALL a mouse saw a cat
Found 1 linkage (1 had no P.P. violations)
        Unique linkage, cost vector = (UNUSED=0 DIS=-13.15 LEN=10)

     +--------------------TI--------------------+
     +------------------TB-----------------+    |
     |           +------TI-----+           |    |
     |           +---TB--+--TC-+--TO--+-TE-+-TC-+
     |           |       |     |      |    |    |
LEFT-WALL.2 LEFT-WALL.2 a.1 mouse.5 saw.3 a.1 cat.5

(this is from a tiny test corpus so its "acceptable".)

linas · 2021-03-18T23:58:23Z

linas
Mar 18, 2021
Maintainer

Amir, you wrote:

This is the other kind of generation. ...

As I said in another post here, I will need for that en English word features dictionary. Do you know about a free one?

(There is some existing code that attempts to do this second kind of generation, but it does not work well. (for many reasons) and just right now, no one is actively working to fix it.)

I intend to look at that after we merge the generation branch.

I also have an idea of a very fast generator for sentences with some fixed words and intervening variable wild-card words.
It needs more deep changes in the code, to the degree that it would not be efficient to use the same functions for regular linkage too. But most of the code would still be the same, so tricky C-preprocessor tricks would be needed to still use the same code (maintaining different code, when most of the code is the same, would be a nightmare)_.

Anyway, the idea is to first fully parse a long-enough sentence, wholly consisting of wild-card words. This may take much time and memory (and may be considered as some kind of initial learning or training). Then the connector table would be copied to a new one, without the zero count entries, and without the count fields, so its size would be very much smaller. The result connector table, along with the disjuncts of the wildcard words would then be dumped to a file, to be reread on program start. When a sentence template is then given, the program would prune these disjuncts according to the template disjuncts, and mk_parse_set() would perform recounting (this pruning is needed in order to preserve the original tracon numbers). More optimizations could be done, like renumbering the tracons Of course, without actually implementing it I cannot know if there are fatal holes in this design...

but I can't find these remarks in the threads above. So;

We should probably start a new discussion thread just for this
Existing link-grammar compatible code is in the opencog cde base, under the name of sureal (surface realization) and microplanning I don't know how good this code is; its old. It might not be worth the effort to try to understand it; I dunno.
There are multiple deep issues with generation, and it would be best to outline the issues in general, discuss a modular design, figure out what the components are and what they should do. For example:

an idea of a very fast generator for sentences with some fixed words and intervening variable wild-card words.

My knee-jerk reaction is to say "hurrah!" because this sounds like it could be very useful. But then again, is it actually useful? Is it really needed? It sounds like something one might want to be able to do, but ... why? Life is short, this set of problems are complex enough that I don't want to spend energy designing something that is not really needed.

6 replies

linas Mar 19, 2021
Maintainer

There s weird browser bug or github bug. The link you gave:

#1146 (reply in thread)

goes to "Yes, something like that would be nice..." You used the word "deep", but its not in any comments (except in the one I just added) Closing and refreshing the window doesn't fix it. Very unusual.

ampli Mar 19, 2021
Collaborator Author

You need to click on "(reply in thread)"

linas Mar 19, 2021
Maintainer

You need to click on "(reply in thread)"

Right. But it doesn't go to the right place, it goes to some other comment (high-lighiting it and everything.)

ampli Mar 19, 2021
Collaborator Author

Indeed there is inconsistent behavior (a bug).
Here it is:

ampli 8 days ago Collaborator Author

Yes, something like that would be nice.

I don't see a problem to restore this previous behavior.

If you want to work on this, we can talk more.

There is a need to have a feature-annotated English word dictionary.

linas Mar 20, 2021
Maintainer

I needed to click on "view more". Long threads have a "view more" button, and the text is not downloaded until I click on it. Thus, search fails. And, apparently, the URL-to-comment also fails, until I hit "view more".

ampli · 2021-03-19T00:18:13Z

ampli
Mar 19, 2021
Collaborator Author

We should probably start a new discussion thread just for this

No problem.

0 replies

ampli · 2021-03-20T04:12:57Z

ampli
Mar 20, 2021
Collaborator Author

I finished implementing the API we talked about and updated link-generator.c to invoke dictionary_get_categories() and print its content on --verbosity=200 (for debug). I used const for the API call result and for the parameters.

It would return a null-terminated array.

Since it is an array of Category and not an array of pointers, I used an element with num_words == 0 to signify termination.

I still need to implement diagram printing in generation mode (printing disjunct-strings for category disjuncts).

I noted that the category words of demo-sql are subscripted with a dot, while the words in implementation of the dict file categories are subscripted with SUBSCRIPT_MARK. For now, I don't know what is more useful.

It seems we forgot the need to set the chosen disjunct word_string to the plugged-words to be able to print the diagram with them.
I propose:
int linkage_set_disjunct_info(Linkage linkage, WordIdx w, char *word_string, double cost); // Return 0 on success, -1 on failure.

My next step is to move compute_generated_words() from linkage.c to link-generator.c (or to the new file I created generator-utilities.c).

2 replies

linas Mar 20, 2021
Maintainer

SUBSCRIPT_MARK

That's a bug. (ad I don't see where it is)

diagram printing in generation mode

why is this needed?

ampli Mar 20, 2021
Collaborator Author

SUBSCRIPT_MARK

That's a bug. (ad I don't see where it is)

I'll try to find out.

diagram printing in generation mode

why is this needed?

Just for debug and inspiration.

ampli · 2021-03-20T04:38:12Z

ampli
Mar 20, 2021
Collaborator Author

I still need to implement diagram printing in generation mode (printing disjunct-strings for category disjuncts).

But it is not clear it is needed, as the API user can use my proposed linkage_set_disjunct_info() for that.

0 replies

ampli · 2021-03-20T08:40:25Z

ampli
Mar 20, 2021
Collaborator Author

I proposed:

int linkage_set_disjunct_info(Linkage linkage, WordIdx w, char *word_string, double cost);

The name is not good, as it said to set the word and cost in the linkage.
But the linkage has only a place to set the word, not the cost.
It may not be a good idea to set the cost directly in the disjunct, because this field is now a part of a union.
This can be solved by adding a cost array to Linkage_s.
Something like:

struct Linkage_s
{
        WordIdx         num_words;    /* Number of (tokenized) words */
        bool            is_sent_long; /* num_words >= twopass_length */
        const char *  * word;         /* Array of word spellings */
        float  *        cost;         /* Array of disjunct costs */
...
};

And the API will then be:

int linkage_set_word(Linkage linkage, WordIdx w, char *word);
int linkage_set_cost(Linkage linkage, WordIdx w, double cost);

1 reply

linas Mar 20, 2021
Maintainer

Right now, I see no reason to print diagrams in generation mode, or to set costs.

I'm using the generator in two ways: to generate n initial corpus from a random dict, and to generate a final corpus from the learned dict, to verify that what was learned matches the input. The actual corpus is boring, and neither the disjuncts, nor the diagram is needed, except maybe for debugging (except I have not needed to debug, and don't expect to.)

I'd recommend just skipping diagram printing. It's just adding pointless complexity.

linas · 2021-03-22T17:18:28Z

linas
Mar 22, 2021
Maintainer

I notice you were reading commits in 'learn'; have you tried running the code?

0 replies

ampli · 2021-03-22T22:30:51Z

ampli
Mar 22, 2021
Collaborator Author

I haven't tried the code yet but I tried to read some comments related to generation.

0 replies

linas · 2021-03-23T17:31:04Z

linas
Mar 23, 2021
Maintainer

During generation, the dictionary may contain disjuncts there are "impossible", in that they are never used (and can't be used, e.g. because they contain connectors that don't have a matching connector going in the other direction, or because of other reasons.) It would be useful to obtain a list of these disjuncts.

10 replies

ampli Mar 31, 2021
Collaborator Author

I just proposed the possibility:

Disjunct * dictionary_unused_disjuncts(Dictionary, double cost, Parse_options opts)

but it is not good since it doesn't provide access to the extrator_t data structure.
(It is also not good in that the provided cost (cost_max) and dialect should match those that were used in sentence_parse, else the result is meaningless.)

So only this seems fine:

Disjunct * sentence_unused_disjuncts(Sentence).

ampli Mar 31, 2021
Collaborator Author

I implemented the following:

Disjunct ** sentence_unused_disjuncts(Sentence); // NULL terminated.
char * disjunct_expression(Disjunct *); // To be free()'ed by the caller.
Category_cost *disjunct_categories(Disjunct *); // NULL terminated

In sentence_unused_disjuncts() build an unused-disjunct pointer array. Hence it needs to be freed by the caller (instead, with more efforts I could qsort() the internal wild-card-word disjunct array (terminated by an all-zeros disjunct) and provide a pointer to its base (so there would be no need to free it). But I guess that freeing is not a problem.
The disjunct_expression() API returns something like A- & B- & C+.
I changed the disjunct_categories() API to return the Category_cost array, just like linkage_get_categories() because it is already available. Thus no need to free its returned value.

I also added to link-generator printing of the number of unused disjuncts, it looks like:
# Number of unused disjuncts: 409
I wanted to make it something like:
# Number of unused disjuncts: 409/12345
but there is no API to get the total number of different disjuncts. (Using --verbosity=5 you can see a debug printout likefind_unused_disjuncts: Info: Unused disjuncts 409/12345.)

ampli Mar 31, 2021
Collaborator Author

Result example.
The invocation uses a new flag --unused-disjuncts to display the unused disjuncts.
For de it displays fewer and fewer disjuncts when with longer sentences, with a minimum starting at sentence length 13.
The printout includes an example word and tits disjunct string that is not used.
One of the questions is whether the format of the disjunct string (printed as a valid expression) is better than just using blanks between the connectors.

$ link-generator en -l de -s 13 -x -u
#
# Corpus for language: "de"
# Sentence length: 13
# Requested corpus size: 50
link-grammar: Info: Dictionary found at ../de/4.0.dict
link-grammar: Info: de: Spell checker disabled.
# Dictionary version 5.9.0
# Number of categories: 178
link-grammar: Warning: Combinatorial explosion! nulls=0 cnt=2147483647
Consider retrying the parse with the max allowed disjunct cost set lower.
At the command line, use !cost-max
# Linkages found: 2147483647
# Linkages generated: 50
# Number of unused disjuncts: 7
#
    0:                    .  Xp- & RW+
    1:                    .  Xc- & Xp- & RW+
    2:            LEFT-WALL  W+ & Xp+
    3:            LEFT-WALL  W+ & RW+
    4:           RIGHT-WALL  RW-
    5:          POSTPROCESS  BOGUS+
    6:             HMS-TIME  BZZT+

Note that the shown LEFT-WALL and RIGHT-WALL "unused" disjuncts are not correct, as disjuncts of fixed words are currently not considered (more code is needed for that, which is similar to the disjunct duplicate elimination).

ampli Mar 31, 2021
Collaborator Author

I finished the API of unused disjuncts.
In dict-api.h:

link_public_api(Disjunct **)
	sentence_unused_disjuncts(Sentence);

link_public_api(char *)
	disjunct_expression(Disjunct *);

link_public_api(const Category_cost *)
	disjunct_categories(Disjunct *);

sentence_unused_disjuncts() and disjunct_expression() needs freeing of their returned value.
Should we provide functions for that (that just use free())? Or is it enough to mention this as a comment in the API declarations?

BTW, I still haven't implemented unused disjuncts consideration for fixed words.

If this API looks fine, I will send a PR.

linas Apr 1, 2021
Maintainer

The API seems reasonable. Mark it as "experimental, subject to change". Using free() is fine, for now.

disjunct_expression() can return either A- & B- & C+ or just A- B- C+, I don't think I care right now.

Some aspects of this are confusing, so let me think out loud a bit. The request arises because I'm generating random dictionaries, and I suspect that those dictionaries include disjuncts that can never be used, no matter what the length of a sentence. But I don't know, and thought it would be interesting to find out. I don't even know if this is a tiny fraction of all, or whether most of them are unusable. To figure that out, what you propose seems adequate. To obtain a list of unused disjuncts for any length, the lists for fixed sentence lengths would need to be intersected; but this is easy enough with bash, perl, scheme.

For sentences with fixed words, its confusing. There are multiple pruning steps; should pruned disjuncts on fixed words count as "unused" or not? What about "unused" disjuncts on *-words? I don't know. .. Right now I have no need for a good answer for this situation.

ampli · 2021-03-27T13:31:15Z

ampli
Mar 27, 2021
Collaborator Author

I modified link-generator.c to restore the costs when printing the disjuncts.
However, it seems these costs are not so meaningful, since each sentence typically has many linkages, with possibly different costs.
This also makes it unmeaningful to compare sentence costs to decide which sentences are less probable.

However, with some effort, it is possible to restore (with library code) the full set of different possible costs for each sentence.
EDIT: It is enough to obtain the minimal cost. This is equivalent to the current ranking of linkages, which is accurate only if there are not "too many" linkages. I thought of a better way, but it is more than "some effort".

1 reply

ampli Apr 1, 2021
Collaborator Author

For sentences with fixed words, its confusing. There are multiple pruning steps; should pruned disjuncts on fixed words count as "unused" or not? What about "unused" disjuncts on *-words? I don't know. .. Right now I have no need for a good answer for this situation.

My algo for finding unused disjuncts compares the disjuncts of *-words in the parse results to those of a *-word as obtained from the dict (i.e. before pruning). The missing ones are reported as unused. The extension for fixed words is to consider them as used if they are used on some fixed word (for long sentences maybe these are only the walls).

It seems possible to use the list of unused disjuncts as feedback to prune the dictionary, to obtain a dictionary with less redundant disjuncts (or maybe even no redundant disjuncts).

linas · 2021-04-04T06:13:35Z

linas
Apr 4, 2021
Maintainer

Regarding this comment

It seems possible to use the list of unused disjuncts as feedback to prune the dictionary, to obtain a dictionary with less redundant disjuncts (or maybe even no redundant disjuncts).

A few years ago, I realized that I should think of dictionaries as having the shape of a spider or centipede, having a complex body, with legs on one side, and legs on another. Each leg on one side is a lists of words sharing a common expression. Each leg on the other side is a macro such as <CLAUSE> which is an expression that is used repeatedly in a number of places. (Here, I'm thinking of a macro as "just a list of disjuncts") The body is the densely tangled mess in between these two sides.

The act of creating a dictionary, either by hand, or automatically, is to define these legs, and the body. Formally, it resembles matrix factorization. Consider the matrix M of all (word,disjunct) pairs -- for the en dict, there are about 125K words, and maybe 20K(???) disjuncts so M is a sparse matrix of dimension 125K x 20K, having a 1 in the matrix entry if (word,disjunct) is in the dict, and otherwise has 0. (Its mostly all zero, of course - its very sparse.)

I claim M can be factored as M = L D R where L is a sparse matrix of dimension 125K x 2300, and D is a dense matrix of dimension 2300 x 510 and R is a sparse matrix of dimension 510 x 20K. (A dense matrix is one where most entries are not zero). I get the two magic numbers like so:

cat 4.0.dict |grep ";" |wc

tells me there are about 2300 different word-lists or "spider-legs" in the dict. Meanwhile:

cat 4.0.dict |grep ">:" |wc

tells me there are about 500 macro definitions in the dict. The "grammar" of the dict, the complexity, is D.

A key part of the language-learning project is to obtain this matrix factorization.

Your code already has the concept of "wordclasses" hard-coded into it. Wordclasses are exactly the L part of the factored matrix. These are the spider-legs on one side of the body. I claim that its possible (or that it should be possible) create/find/use/formalize "disjunct classes". Right now, there is no clean or clear-cut concept of "disjunct classes". Instead we have a tangle of Expressions and macros which encode the D R part of the matrix factorization.

One reason that I want to think of this as a matrix factorization problem is that there is a lot of interesting theory to draw on. There are some very traditional matrix factorization algos: these are used by google and amazon to factorize (customer, product) matrices. There are some more sophisticated algos, used for factoring (antigen,immunoglobulin) pairs (I kid you not - the paper I read did it for zebrafish, which have a simple immune system, and they contrasted several matrix factorization methods, including Ising models and Markov matrices and more complex combinatorial techniques). Last but not least, the matrix D is dense (and pretty small), which means that the whizzy deep-learning/neural-net algos should work very well for it ... if we can first factor away the L and R parts.

The above is easy to say (well, easy for me to say). It takes a little more work to understand (I have an 80 page paper that goes through the details, including a reference to the google, amazon and zebrafish papers). The very hardest part, however, is writing the code to make it happen, to actually do it.

Anyway: You now understand what the L part is -- its the word-classes. Think about how one might find the R part, and think about whether it is useful to find it or not. (It's useful for me, for the learning code, it simplifies that problem, reduces it's size.) It doesn't seem useful for parsing, but I haven't thought about that. I suppose that 10 or 20 years ago, it might have been useful for hand-curating dictionaries, to untangle the mess... but now .. ugh.

4 replies

ampli Apr 4, 2021
Collaborator Author

and maybe 20K(???) disjuncts

Did you mean 20M?
When I sum the disjuncts of all categories in the en dict (using the default cost-max) I get a total of ~19M.

Some comments:

Only ~3M of them are unique.
Some of them have jets with more than 6 connectors and those obviously cannot connect. Some others require connection before LEFT-WORD. The generation mode is not optimized yet to run on the en dict with ~12 wildcard words that are needed for a better estimation of the number of unusable disjuncts.
The "exceptions" that are currently handled by post-processing worry me regarding the possibility of such factorization without deliberately losing information (unless somehow these post-processing rules can be inferred too).

linas Apr 4, 2021
Maintainer

Did you mean 20M?

I guess I did! Wow!

Only ~3M of them are unique.

That means there are at most 3M disjunct classes (right?), and maybe a whole lot less.

post-processing

OK, so here's the deal. 10 years ago, when the en dict was smaller and simpler, post-processing was important. As the dict grew, part of the growth added more and more constraints, resulting in fewer and fewer post-processng rules triggering. Last time I looked, 3-4 years ago, almost half of the post-processing rules never-ever triggered, even after many hours of parsing. It might be safe to remove them.

For the rules that are still active, it might be possible to deactivate them by altering the dictionary to be more tightly constrained. This seemed to be possible in the few cases I looked at. Doing this is a chore - its time-consuming work. Testing to see if it made things better or worse is hard to do. So I'm happy to leave post-processing alone, for now.

Then, the meta-issues. The learning code doesn't have any way of learning post-processing rules. It could, I suppose, but should it? It's complicated.

The other meta-issue is testing. Just because LG produces a (low-cost) parse, that doesn't mean its the right parse. I don't know how to automate testing for this, except to manually review each and every parse, and decide if it is good or not. It's hard to bake this into a unit test, because changes to the dict can convert good parses into even-better parses, while the unit test would flag the even-better parse as an error. There is no obviously-good solution to this, and I am not convinced that it is even worth the effort to find a solution. If the ultimate goal is to learn dictionaries automatically, "goodness" has to be measured some other way.

20M

Here's a deeply flawed, non-sense estimate:

cat 4.0.dict |grep "or" |wc
   6948
cat 4.0.dict |grep "&" |wc
   5693

This suggests perhaps 7K x 5K = 35M disjuncts. So the right ballpark.

ampli Apr 4, 2021
Collaborator Author

That means there are at most 3M disjunct classes (right?), and maybe a whole lot less.

I'm not sure about what you mean in "X disjunct classes".

Here is the result of link-generate -l en -s 5 --verbosity=5:

#### Creating a wild-card word disjunct list
#### Finished tokenizing (3 tokens)
Trace: eliminate_duplicate_disjuncts: w1: Killed 3598714 duplicates
Trace: eliminate_duplicate_disjuncts: w1: Killed 12463039 duplicates (different word-strings)
Debug: Encode for pruning (len 3): tracon_id 114708 (38755+,75953-), shared connectors 25177125
++++ Finished creating list: 3096462 disjuncts   16.24 seconds
^C

The 3598714 duplicates are with the same word_string, i.e. duplicates when expanding the disjuncts of single categories.
The 12463039 duplicates are when ignoring the word_string (i.e. identical disjuncts in different categories).
The remaining ~3M disjuncts contain maybe 20% with "too many" connectors that cannot connect (I need to check that again), and some unknown fraction (I have not checked yet) of disjuncts that try to connect before LEFT-WALL. (In a usual parse they don't add overhead because they are pruned in a negligible time.)

Last time I looked, 3-4 years ago, almost half of the post-processing rules never-ever triggered, even after many hours of parsing. It might be safe to remove them.

Because I once started to rewrite the post-processing (and found it can be done several times faster), I looked at your check of which post-processing rules are not triggered, and it has a problem: it doesn't take into account that pprune() already use many of the rules, so they are not activated in the post-processing stage (it can be trivially fixed by counting the rules also in pprune()).

This seemed to be possible in the few cases I looked at.

I think that in general, it may be impossible since post-processing deals will link names, which you cannot generally restrict in the dict, and also they deal with link formats, like "must form a circle". Since link names are known in power_prune() and while parsing, I thought about ways to implement there part of the post-processing, but didn't find a way to do that.

In the "old times", while constructing a more advanced Hebrew dict (incomplete and unpublished work), I encountered a case in which it seemed that only post-processing is the reasonable solution. It was a case involving a general rule of combining several words, that had 2 exceptions of certain links that are forbidden. Because such exceptions cannot be described in the dict, trying to directly do it in the dict involved abandoning the general rules and inventing a lot of connectors just to implement a long list of private cases. I will try to simplify this problem to a simple rule plus a few words in order to demonstrate that (it may take a while).

Then, the meta-issues. The learning code doesn't have any way of learning post-processing rules. It could, I suppose, but should it? It's complicated.

I think it can bypass the need for post-processing by just generating a lot of private cases.

The other meta-issue is testing.

The test suite may test mainly simple sentences in which the best parse is "known" and less likely to change, and this only in order to check if the development process didn't introduce obvious bugs to the library or the dictionary. I guess this is more what it does now.

If the ultimate goal is to learn dictionaries automatically, "goodness" has to be measured some other way.

A partial way may be to just do an accept/reject test on a corpus in which it is already known which sentences are grammatical (or that we can suppose that absolutely most of the sentences int are all grammatical or all ungrammatical). This of course doesn't tell if the detailed parsing is correct.

(From here I continued to think about the possibility of automatic learning of punctuation or the roll of uppercase letters. This maybe belongs to the previous discussion ("next steps")).

linas Apr 5, 2021
Maintainer

disjunct classes

I do not currently have any good formal definition for these. It's more of an informal observation that they are there, but that the boundary between the D part and the R part is blurry. (The boundary between L and D is also blurry, but in practice the blur is removed by using subscripts on words.)

it doesn't take into account that pprune() already use many of the rules

Ooops! Well, that's a mistake.

I encountered a case in which it seemed that only post-processing is the reasonable solution

The use of lower case letters in link names can enforce long-distance constraints. How to do this is not obvious; it was a long time before I realized this could be done. Anyway, the learning code does not know how to use lower-case letters in link names, or how to impose constraints. That's a different project for some other day.

ampli · 2021-06-15T00:07:12Z

ampli
Jun 15, 2021
Collaborator Author

In https://groups.google.com/g/link-grammar/c/GWLVSVWaUDU/m/6Klgg3oyAQAJ @linas said:

p.s. Amir, we should talk, probably off-list, about whether we can steal any aspect of the above, and slot it into the existing LG code base. For example, during counting, to maybe limit choices, or something. There are ... subtleties ...

Indeed. After some thought, I still don't have an idea how to apply the "entropy collapse algorithm" using LG. One problem is that the LG constraints are not local at all. Another problem is that a local minimum doesn't imply a minimum sentence cost.

However:

There are some other subtle points. Suppose I want to generate a sentences with specific words in them, but I don't know the word order, yet. The words might be "Ben" "eats" and "pizza" but of course, one could generate: "the pizza was eaten by Ben" or "we saw Ben eating a pizza" or "he was eating a pizza, when we last saw Ben". Worse: we need lexical functions to provide the alternatives eats-eating-eaten. So there's lots of interesting stuff to explore.

I have some ideas regarding that, using Python libraries like inflect combined with LG.

None-the-less, the core idea, of assembling the jigsaw-puzzle pieces in such a way that one is working with the minimum-entropy locations first, is a fantastic idea. (Painfully obvious, in retrospect: when one assembles a real-life jigsaw puzzle, one does exactly this: one starts with edges, cause there aren't very many of them. Then one organizes by color, and does the ones with the smallest number of choices, first. And so on.)

LG currently does this when generating a sentence: The words at the edges are severely constrained by the walls, which limits their disjunct length, and then the power pruning does a good job there.

In any case, we may try to extend the LG library with an infrastructure that enables better generation.
For example, it is easy to add the following:
At each slot, a set of list of words can be supplied (each list can be ordered or unordered or have other constraints, and can even be empty- the library supports that using "optional word").

Just now a very limited form of this idea is already supported: A wildcard prefix. For example, try this:

$ link-generator -l en -s 0
This is a\* inter\* test.

By a proper API, this can be extended to a set of sentence fragments for each slot (this is what I referred to above by ordered list of words, but they can be marked as unordered if desired):
Slot 1: "He"
Slot 2: [empty] OR "went to" OR "bought from the" OR "is"
Slot 3: ...
Slot n: "shop."
(The current infrastructure supports that.)

On top of such (and more) experimental infrastructure, an experimental generator may be written using Python bindings (and several other external libraries).
The question is how not to clutter the library with too much experimental stuff... Doing it in another branch would not help, as each addition to the library would need to consider the needs of both branches (or even worst - several branches).

The idea is that the "classic" parser is extremely fast and as long as you are not going to assign an extremely big list of words in each slot (like now, using \*) it might be fast enough for such a generation method.

I already started to expand the library infrastructure by adding the ability to parse text fragments and words which may expand to a list of other words (unfinished).

0 replies

linas · 2022-03-21T19:35:27Z

Generation mode #1146

ampli Feb 27, 2021 Collaborator

Replies: 27 comments · 103 replies

ampli Feb 27, 2021 Collaborator Author

ampli Feb 27, 2021 Collaborator Author

ampli Feb 27, 2021 Collaborator Author

ampli Mar 10, 2021 Collaborator Author

linas Mar 10, 2021 Maintainer

ampli Mar 11, 2021 Collaborator Author

ampli Mar 11, 2021 Collaborator Author

ampli Mar 12, 2021 Collaborator Author

linas Mar 9, 2021 Maintainer

ampli Mar 11, 2021 Collaborator Author

ampli Mar 11, 2021 Collaborator Author

ampli Mar 11, 2021 Collaborator Author

ampli Mar 12, 2021 Collaborator Author

linas Mar 14, 2021 Maintainer

ampli Mar 10, 2021 Collaborator Author

ampli Mar 10, 2021 Collaborator Author

linas Mar 14, 2021 Maintainer

ampli Mar 14, 2021 Collaborator Author

ampli Mar 10, 2021 Collaborator Author

ampli Mar 20, 2021 Collaborator Author

linas Mar 20, 2021 Maintainer

linas Mar 20, 2021 Maintainer

ampli Mar 20, 2021 Collaborator Author

ampli Mar 20, 2021 Collaborator Author

linas Mar 14, 2021 Maintainer

ampli Mar 14, 2021 Collaborator Author

ampli Mar 14, 2021 Collaborator Author

linas Mar 14, 2021 Maintainer

ampli Mar 14, 2021 Collaborator Author

linas Mar 14, 2021 Maintainer

ampli Mar 14, 2021 Collaborator Author

ampli Mar 14, 2021 Collaborator Author

ampli Mar 14, 2021 Collaborator Author

ampli Mar 15, 2021 Collaborator Author

linas Mar 15, 2021 Maintainer

ampli Mar 15, 2021 Collaborator Author

linas Mar 15, 2021 Maintainer

ampli Mar 15, 2021 Collaborator Author

ampli Mar 17, 2021 Collaborator Author

linas Mar 15, 2021 Maintainer

ampli Mar 15, 2021 Collaborator Author

ampli Mar 15, 2021 Collaborator Author

ampli Mar 15, 2021 Collaborator Author

linas Mar 16, 2021 Maintainer

linas Mar 16, 2021 Maintainer

linas Mar 15, 2021 Maintainer

ampli Mar 15, 2021 Collaborator Author

linas Mar 15, 2021 Maintainer

ampli Mar 15, 2021 Collaborator Author

linas Mar 15, 2021 Maintainer

ampli Mar 16, 2021 Collaborator Author

linas Mar 17, 2021 Maintainer

linas Mar 17, 2021 Maintainer

linas Mar 17, 2021 Maintainer

linas Mar 17, 2021 Maintainer

ampli Mar 17, 2021 Collaborator Author

linas Mar 17, 2021 Maintainer

linas Mar 17, 2021 Maintainer

ampli Mar 17, 2021 Collaborator Author

ampli Mar 17, 2021 Collaborator Author

ampli
Feb 27, 2021
Collaborator

Replies: 27 comments 103 replies

ampli
Feb 27, 2021
Collaborator Author

ampli
Feb 27, 2021
Collaborator Author

ampli
Feb 27, 2021
Collaborator Author

ampli Mar 10, 2021
Collaborator Author

linas Mar 10, 2021
Maintainer

ampli Mar 11, 2021
Collaborator Author

ampli Mar 11, 2021
Collaborator Author

ampli Mar 12, 2021
Collaborator Author

linas
Mar 9, 2021
Maintainer

ampli Mar 11, 2021
Collaborator Author

ampli Mar 11, 2021
Collaborator Author

ampli Mar 11, 2021
Collaborator Author

ampli Mar 12, 2021
Collaborator Author

linas Mar 14, 2021
Maintainer

ampli
Mar 10, 2021
Collaborator Author

ampli Mar 10, 2021
Collaborator Author

linas Mar 14, 2021
Maintainer

ampli Mar 14, 2021
Collaborator Author

ampli
Mar 10, 2021
Collaborator Author

ampli Mar 20, 2021
Collaborator Author

linas Mar 20, 2021
Maintainer

linas Mar 20, 2021
Maintainer

ampli Mar 20, 2021
Collaborator Author

ampli Mar 20, 2021
Collaborator Author

linas
Mar 14, 2021
Maintainer

ampli Mar 14, 2021
Collaborator Author

ampli Mar 14, 2021
Collaborator Author

linas Mar 14, 2021
Maintainer

ampli Mar 14, 2021
Collaborator Author

linas Mar 14, 2021
Maintainer

ampli
Mar 14, 2021
Collaborator Author

ampli Mar 14, 2021
Collaborator Author

ampli Mar 14, 2021
Collaborator Author

ampli
Mar 15, 2021
Collaborator Author

linas Mar 15, 2021
Maintainer

ampli Mar 15, 2021
Collaborator Author

linas Mar 15, 2021
Maintainer

ampli Mar 15, 2021
Collaborator Author

ampli Mar 17, 2021
Collaborator Author

linas
Mar 15, 2021
Maintainer

ampli Mar 15, 2021
Collaborator Author

ampli Mar 15, 2021
Collaborator Author

ampli Mar 15, 2021
Collaborator Author

linas Mar 16, 2021
Maintainer

linas Mar 16, 2021
Maintainer

linas
Mar 15, 2021
Maintainer

ampli Mar 15, 2021
Collaborator Author

linas Mar 15, 2021
Maintainer

ampli Mar 15, 2021
Collaborator Author

linas Mar 15, 2021
Maintainer

ampli
Mar 16, 2021
Collaborator Author

linas Mar 17, 2021
Maintainer

linas Mar 17, 2021
Maintainer

linas
Mar 17, 2021
Maintainer

linas Mar 17, 2021
Maintainer

ampli
Mar 17, 2021
Collaborator Author

linas Mar 17, 2021
Maintainer

linas Mar 17, 2021
Maintainer

ampli Mar 17, 2021
Collaborator Author

ampli
Mar 17, 2021
Collaborator Author