Releases · GoogleCloudPlatform/dataflow-opinion-analysis

08 Feb 00:07

datancoffee

v0.6.6

13e0340

Release 0.6.6 Latest

Latest

Adding v1 version of a Dictionary Builder Pipeline to calculate ngram stats
Added option to read from a Reddit archive file
Using Sirocco 1.0.8

Assets 3

08 May 15:12

datancoffee

v0.6.5

52beb38

Release 0.6.5

Release 0.6.5:

Updated the models\RedditEngagement.ipynb notebook:

Removed dependencies on gfile
Added a mode current_run_in_colab=True for running the notebook in Colab or Kaggle
Made more resillient to missing packages in the public execution environments like Colab and Kaggle
Added a mode current_read_from_bq=True where users can decide whether to get their training data from BigQuery or from snapshot files
Uploaded snapshot files to models\data so that users without BigQuery access can still run the model
Made the notebook available on Kaggle https://www.kaggle.com/datancoffee/predicting-community-engagement-on-reddit/

Assets 2

12 Feb 02:57

datancoffee

v0.6.4

3a527c5

Release 0.6.4

Switched sourceRecordFile option to use TextIO again, now that TextIO supports custom delimeters.
Added --recordDelimiters option to use with the new withDelimeters option of TextIO. Accepts a list of integers representing record delimeters. If this option is not specified, then the default TextIO delimeters '\r', '\n', and "\r\n" are used. Set this option to a ASCII character that you know does not appear in input file, e.g. 30 - the decimal representation of ASCII Record Separator, to import the entire file as one record.
CSV import support: Added --readAsCSV option to use with --sourceRecordFile. Mandatory parameter for CSV inputs is --textColumnIdx, which is a 0-based column index in the input file, specifying the column where the text content resides. An optional parameter --collectionItemIdIdx=0 is the column index for a unique ID that will be written into the output. The collectionItemIdIdx can then be used to join the indexed text with other records. Important: the --recordDelimiters parameter needs to be set to character not present in input so that the entire file is read at once.

Examples:

Base case: importing text files

mvn compile exec:java
-Dexec.mainClass=com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline
-Dexec.args="--project=$PROJECT_ID
--runner=DataflowRunner
--maxNumWorkers=10
--workerMachineType=n1-standard-2
--stagingLocation=gs://$GCS_BUCKET/staging/
--tempLocation=gs://$GCS_BUCKET/temp/
--streaming=false
--autoscalingAlgorithm=THROUGHPUT_BASED
--bigQueryDataset=opinions
--writeTruncate=true
--processedUrlHistorySec=130000
--wrSocialCountHistoryWindowSec=610000
--ratioEnrichWithCNLP=0
--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.txt
--indexAsShorttext=false
"

Importing text files, the entire file treated as its own record:

--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.txt
--readAsPropertyBag=true
--recordDelimiters=30

Importing text files, each line (separated by new line character) a separate record.

--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.txt
--readAsPropertyBag=true

Importing CSV files:
--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.csv
--readAsCSV=true
--textColumnIdx=$ZERO_BASED_COLUMNINDEX_IN_CSV_FILE
--collectionItemIdIdx=$OPTIONAL_UNIQUE_ID_COLUMNINDEX_IN_CSV_FILE
--recordDelimiters=30

Assets 3

06 Feb 21:16

datancoffee

v0.6.3

32a3344

Release 0.6.3

Release 0.6.3:

Upgraded to Dataflow SDK version 2.2.0

Assets 3

14 Dec 08:01

datancoffee

v0.6.2

3393b55

Release 0.6.2

Java code:

Fixed an issue in the FileIndexerPipeline when an extra newline was added in the output CSV file

Assets 2

13 Dec 05:40

datancoffee

v0.6.1

3bd2344

Release 0.6.1: MetaFields and CSV import-export solution

Release 0.6.1

BigQuery:

added the MetaFields field to webresource table for source-specific metadata fields accessible via SAFE_OFFSET(0-base index).

Java code:

Added propagation of MetaFields field from sources all the way to the BigQuery dataset
added the solutions package and as the first solution added the FileIndexerPipeline that takes a CSV file, indexes it, and writes a CSV file
added dependency on Commons CSV package that helps with CSV processing

Assets 2

05 Nov 21:55

datancoffee

v0.5.1

dd1a69d

Release 0.5.1

Major changes: perf improvements in BigQuery dataset (incl materialized views, partitioning) , refactored IndexerPipeline for easier understanding

BigQuery:

added partitioning to raw fact tables document, sentiment, webresource, wrsocialcount and some stats table that have daily snapshots statstoryimpact, stattopic
reload_metadata_template.sh: added script to populate the metadata "topic" table
Added topic table to store blocked topics
Materialized many former views to tables for faster querying: statdomainopinions, statstoryrank, stattopstory7d, stattoptopic7d, stattoptopic7dsentiment

Dataflow pipelines:

Refactored IndexerPipeline so that code is easier to read
Added Reshuffle transform: allows to break up fused steps, e.g. when getting OOMs
Added SplitAB transform: Divide your PCollection in A and B branches by defined ratio
Added PartitionedTableRef: Helper for writing into partitioned BigQuery tables
Added write to Bigtable to store the dead letter queue/ bad data
Added integration with CloudNLP to get their entities and store in BigQuery
Added tutorial package and OpinionAnalysisPipeline class as the basis for a future tutorial
StatsCalcPipeline: added calculation statements to calculate the new stats tables backing views
config.properties: added an override file to control the sirocco-sa config
custom-idioms-en.csv: added override file for custom dictionaries

Assets 2

20 Jul 17:44

datancoffee

v0.4.0

cfd1d9b

Support for Reddit, support for Dataflow SDK 2.0

Java pipelines:

Switched from Dataflow SDK version 1.9.0 to Dataflow SDK version 2.0.0
Started using the filesToStage parameter pointing to a single bundled (shaded) jar containing all dependencies
Modified the pom.xml file to converge the dependency versions. Added maven-enforcer-plugin maven plugin to enforce convergence.
Migrated code to use the new 2.0 SDK
Added support for Reddit BigQuery public dataset as an input source

BigQuery schema:

Added the MainWebResourceHash and ParentWebResourceHash fields to the document, sentiment, and webresource tables to better support threaded conversations, e.g. in Reddit, HackerNews, email, or discussion boards

Scripts:

Added the scripts\run_indexer_reddit_template.sh script to launch an IndexerPipeline parametarized to run a Reddit import.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: GoogleCloudPlatform/dataflow-opinion-analysis

Release 0.6.6

Release 0.6.5

Release 0.6.4

Release 0.6.3

Release 0.6.2

Release 0.6.1: MetaFields and CSV import-export solution

Release 0.5.1

Support for Reddit, support for Dataflow SDK 2.0