GitHub - ai-ku/glookup: glookup - reads ngram patterns with wildcards from stdin and prints their counts from the Web1T Google ngram data.

ai-ku / glookup Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

glookup - reads ngram patterns with wildcards from stdin and prints their counts from the Web1T Google ngram data.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.gitignore		.gitignore
ChangeLog		ChangeLog
Makefile		Makefile
README		README
dlib.c		dlib.c
dlib.h		dlib.h
glookup.1		glookup.1
glookup.c		glookup.c
glookup.pl		glookup.pl
gngram.pl		gngram.pl
gtokenize.pl		gtokenize.pl
model.pl		model.pl
model.txt		model.txt
submatch.pl		submatch.pl

Repository files navigation

GLOOKUP  	            Copyright (c) 2008-2014, Deniz Yuret

This is the code used in:

Deniz Yuret. 2008. Smoothing a Tera-word Language Model. In the 46th
Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies.  See http://goo.gl/rmD87d for details.

The glookup program reads ngram patterns with wildcards (represented
with the '_' character) from stdin and prints their counts from the
Web1T Google ngram data (whose path is given by the -p option).
Please see glookup.1 (man page), or glookup.txt (plain text format)
for documentation.

The model.pl script optimizes and tests various language models.  See
'perldoc model.pl', or model.txt for documentation.  Typical usage:

      model.pl -patterns < text > patterns
      glookup -p web1t_path < patterns > counts
      model.pl -counts counts < text

The glookup.pl script quickly searches for a given pattern in
uncompressed Google Web1T data. Use the C version for bulk processing,
the perl version to get a few counts quickly.