Replies: 27 comments 103 replies
-
How can we maintain it? Also, I would like that it would be based on the master branch, so all library future improvements will get to it, and also so we will be able to merge it eventually with the master branch without conflicts. It means every PR to the master branch should also be applied to the "generate" branch (even if it is totally unrelated, like applied to |
Beta Was this translation helpful? Give feedback.
-
Originally, I intended to trigger the "generating" mode using "-test=generate". However, I then noted that in the PR for the improved Suggestion for a less-hacky API for that:
I propose to use option (1) for now. |
Beta Was this translation helpful? Give feedback.
-
Suggestion for a better API:
I propose to use option (1). It is also somewhat more flexible in case you would like to generate sentences with certain words, like only sentences ending with a period. |
Beta Was this translation helpful? Give feedback.
-
Continuing discussion from pull req #1143
Oh, OK. Interesting. One is enough, but it should not be a "random sampling", it should really output a category label. This allows some later stage to pick the right word from that category: e..g. by uniform random sampling, Zipfian random sampling, or by using some external info to select a specific "synonym" in that category. The tricky question is "how to print that category label". I can avoid this question in the short-term, by using dictionaries that have only one unique word per category.
I'll move the "generate" branch over to this repo. Then I think you can do pull reqs on it (I guess, I've never done that before). Another possibility is you can pull my |
Beta Was this translation helpful? Give feedback.
-
First thing, it is needed to rebase it on |
Beta Was this translation helpful? Give feedback.
-
I said:
However, it turned out it is not that simple, since the disjunct eliminating code (in the |
Beta Was this translation helpful? Give feedback.
-
Dictionaries stored in sqlite3 don't seem to work. The |
Beta Was this translation helpful? Give feedback.
-
So I guess I can forget it for now. |
Beta Was this translation helpful? Give feedback.
-
(Starting a new thread for that topic.)
A word can indeed appear in several categories, especially with different subscripts. |
Beta Was this translation helpful? Give feedback.
-
Bug: it seems that |
Beta Was this translation helpful? Give feedback.
-
For each category, it seems you store the word, and not the subscripted word. Why? Does it matter? Is something easier, if done one way and not another? |
Beta Was this translation helpful? Give feedback.
-
@linas, |
Beta Was this translation helpful? Give feedback.
-
Bug: With the SQL-backed dict, I am seeing sentences with two or three left-walls in them, even though left-wall has no left-facing connectors in them ... |
Beta Was this translation helpful? Give feedback.
-
You should not put them in the categories data structure.
(For the dict file, they also don't include macros/regex labels and
limit-directives).
בתאריך יום ד׳, 17 במרץ 2021, 06:03, מאת Linas Vepštas <
***@***.***>:
… Bug: With the SQL-backed dict, I am seeing sentences with two or three
left-walls in them, even though left-wall has no left-facing connectors in
them ...
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1146 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGWXGRZ66GDZAZUKNJCJTTTEAS2VANCNFSM4YKEE45Q>
.
|
Beta Was this translation helpful? Give feedback.
-
If you look at the disjuncts you will see these sentences still have a full linkage. |
Beta Was this translation helpful? Give feedback.
-
Amir, you wrote:
but I can't find these remarks in the threads above. So;
My knee-jerk reaction is to say "hurrah!" because this sounds like it could be very useful. But then again, is it actually useful? Is it really needed? It sounds like something one might want to be able to do, but ... why? Life is short, this set of problems are complex enough that I don't want to spend energy designing something that is not really needed. |
Beta Was this translation helpful? Give feedback.
-
No problem. |
Beta Was this translation helpful? Give feedback.
-
I finished implementing the API we talked about and updated
Since it is an array of I still need to implement diagram printing in generation mode (printing disjunct-strings for category disjuncts). I noted that the category words of It seems we forgot the need to set the chosen disjunct My next step is to move |
Beta Was this translation helpful? Give feedback.
-
But it is not clear it is needed, as the API user can use my proposed |
Beta Was this translation helpful? Give feedback.
-
I proposed:
The name is not good, as it said to set the word and cost in the linkage. struct Linkage_s
{
WordIdx num_words; /* Number of (tokenized) words */
bool is_sent_long; /* num_words >= twopass_length */
const char * * word; /* Array of word spellings */
float * cost; /* Array of disjunct costs */
...
}; And the API will then be: int linkage_set_word(Linkage linkage, WordIdx w, char *word);
int linkage_set_cost(Linkage linkage, WordIdx w, double cost); |
Beta Was this translation helpful? Give feedback.
-
I notice you were reading commits in 'learn'; have you tried running the code? |
Beta Was this translation helpful? Give feedback.
-
I haven't tried the code yet but I tried to read some comments related to generation. |
Beta Was this translation helpful? Give feedback.
-
During generation, the dictionary may contain disjuncts there are "impossible", in that they are never used (and can't be used, e.g. because they contain connectors that don't have a matching connector going in the other direction, or because of other reasons.) It would be useful to obtain a list of these disjuncts. |
Beta Was this translation helpful? Give feedback.
-
I modified However, with some effort, it is possible to restore (with library code) the full set of different possible costs for each sentence. |
Beta Was this translation helpful? Give feedback.
-
Regarding this comment
A few years ago, I realized that I should think of dictionaries as having the shape of a spider or centipede, having a complex body, with legs on one side, and legs on another. Each leg on one side is a lists of words sharing a common expression. Each leg on the other side is a macro such as The act of creating a dictionary, either by hand, or automatically, is to define these legs, and the body. Formally, it resembles matrix factorization. Consider the matrix I claim
tells me there are about 2300 different word-lists or "spider-legs" in the dict. Meanwhile:
tells me there are about 500 macro definitions in the dict. The "grammar" of the dict, the complexity, is A key part of the language-learning project is to obtain this matrix factorization. Your code already has the concept of "wordclasses" hard-coded into it. Wordclasses are exactly the One reason that I want to think of this as a matrix factorization problem is that there is a lot of interesting theory to draw on. There are some very traditional matrix factorization algos: these are used by google and amazon to factorize (customer, product) matrices. There are some more sophisticated algos, used for factoring (antigen,immunoglobulin) pairs (I kid you not - the paper I read did it for zebrafish, which have a simple immune system, and they contrasted several matrix factorization methods, including Ising models and Markov matrices and more complex combinatorial techniques). Last but not least, the matrix The above is easy to say (well, easy for me to say). It takes a little more work to understand (I have an 80 page paper that goes through the details, including a reference to the google, amazon and zebrafish papers). The very hardest part, however, is writing the code to make it happen, to actually do it. Anyway: You now understand what the |
Beta Was this translation helpful? Give feedback.
-
In https://groups.google.com/g/link-grammar/c/GWLVSVWaUDU/m/6Klgg3oyAQAJ @linas said:
Indeed. After some thought, I still don't have an idea how to apply the "entropy collapse algorithm" using LG. One problem is that the LG constraints are not local at all. Another problem is that a local minimum doesn't imply a minimum sentence cost. However:
I have some ideas regarding that, using Python libraries like inflect combined with LG.
LG currently does this when generating a sentence: The words at the edges are severely constrained by the walls, which limits their disjunct length, and then the power pruning does a good job there. In any case, we may try to extend the LG library with an infrastructure that enables better generation. Just now a very limited form of this idea is already supported: A wildcard prefix. For example, try this:
I.e. each slot can use the current wildcard expansion infrastructure. This can trivially be expanded to using a regex (and would even speed up the generation much due to the added word constraints): By a proper API, this can be extended to a set of sentence fragments for each slot (this is what I referred to above by ordered list of words, but they can be marked as unordered if desired): On top of such (and more) experimental infrastructure, an experimental generator may be written using Python bindings (and several other external libraries). The idea is that the "classic" parser is extremely fast and as long as you are not going to assign an extremely big list of words in each slot (like now, using I already started to expand the library infrastructure by adding the ability to parse text fragments and words which may expand to a list of other words (unfinished). |
Beta Was this translation helpful? Give feedback.
-
See also discussion in #1290 |
Beta Was this translation helpful? Give feedback.
-
Previous discussion on that at these places:
Beta Was this translation helpful? Give feedback.
All reactions