Skip to content

Releases: GoogleCloudPlatform/dataflow-opinion-analysis

Release 0.6.6

08 Feb 00:07
Compare
Choose a tag to compare
  • Adding v1 version of a Dictionary Builder Pipeline to calculate ngram stats
  • Added option to read from a Reddit archive file
  • Using Sirocco 1.0.8

Release 0.6.5

08 May 15:12
Compare
Choose a tag to compare

Release 0.6.5:

Updated the models\RedditEngagement.ipynb notebook:

  • Removed dependencies on gfile
  • Added a mode current_run_in_colab=True for running the notebook in Colab or Kaggle
  • Made more resillient to missing packages in the public execution environments like Colab and Kaggle
  • Added a mode current_read_from_bq=True where users can decide whether to get their training data from BigQuery or from snapshot files
  • Uploaded snapshot files to models\data so that users without BigQuery access can still run the model
  • Made the notebook available on Kaggle https://www.kaggle.com/datancoffee/predicting-community-engagement-on-reddit/

Release 0.6.4

12 Feb 02:57
Compare
Choose a tag to compare

Release 0.6.4

  • Switched sourceRecordFile option to use TextIO again, now that TextIO supports custom delimeters.
  • Added --recordDelimiters option to use with the new withDelimeters option of TextIO. Accepts a list of integers representing record delimeters. If this option is not specified, then the default TextIO delimeters '\r', '\n', and "\r\n" are used. Set this option to a ASCII character that you know does not appear in input file, e.g. 30 - the decimal representation of ASCII Record Separator, to import the entire file as one record.
  • CSV import support: Added --readAsCSV option to use with --sourceRecordFile. Mandatory parameter for CSV inputs is --textColumnIdx, which is a 0-based column index in the input file, specifying the column where the text content resides. An optional parameter --collectionItemIdIdx=0 is the column index for a unique ID that will be written into the output. The collectionItemIdIdx can then be used to join the indexed text with other records. Important: the --recordDelimiters parameter needs to be set to character not present in input so that the entire file is read at once.

Examples:

Base case: importing text files

mvn compile exec:java
-Dexec.mainClass=com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline
-Dexec.args="--project=$PROJECT_ID
--runner=DataflowRunner
--maxNumWorkers=10
--workerMachineType=n1-standard-2
--stagingLocation=gs://$GCS_BUCKET/staging/
--tempLocation=gs://$GCS_BUCKET/temp/
--streaming=false
--autoscalingAlgorithm=THROUGHPUT_BASED
--bigQueryDataset=opinions
--writeTruncate=true
--processedUrlHistorySec=130000
--wrSocialCountHistoryWindowSec=610000
--ratioEnrichWithCNLP=0
--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.txt
--indexAsShorttext=false
"

Importing text files, the entire file treated as its own record:

--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.txt
--readAsPropertyBag=true
--recordDelimiters=30

Importing text files, each line (separated by new line character) a separate record.

--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.txt
--readAsPropertyBag=true

Importing CSV files:
--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.csv
--readAsCSV=true
--textColumnIdx=$ZERO_BASED_COLUMNINDEX_IN_CSV_FILE
--collectionItemIdIdx=$OPTIONAL_UNIQUE_ID_COLUMNINDEX_IN_CSV_FILE
--recordDelimiters=30

Release 0.6.3

06 Feb 21:16
Compare
Choose a tag to compare

Release 0.6.3:

  • Upgraded to Dataflow SDK version 2.2.0

Release 0.6.2

14 Dec 08:01
Compare
Choose a tag to compare

Release 0.6.2

Java code:

  • Fixed an issue in the FileIndexerPipeline when an extra newline was added in the output CSV file

Release 0.6.1: MetaFields and CSV import-export solution

13 Dec 05:40
Compare
Choose a tag to compare

Release 0.6.1

BigQuery:

  • added the MetaFields field to webresource table for source-specific metadata fields accessible via SAFE_OFFSET(0-base index).

Java code:

  • Added propagation of MetaFields field from sources all the way to the BigQuery dataset
  • added the solutions package and as the first solution added the FileIndexerPipeline that takes a CSV file, indexes it, and writes a CSV file
  • added dependency on Commons CSV package that helps with CSV processing

Release 0.5.1

05 Nov 21:55
Compare
Choose a tag to compare

Major changes: perf improvements in BigQuery dataset (incl materialized views, partitioning) , refactored IndexerPipeline for easier understanding

BigQuery:

  • added partitioning to raw fact tables document, sentiment, webresource, wrsocialcount and some stats table that have daily snapshots statstoryimpact, stattopic
  • reload_metadata_template.sh: added script to populate the metadata "topic" table
  • Added topic table to store blocked topics
  • Materialized many former views to tables for faster querying: statdomainopinions, statstoryrank, stattopstory7d, stattoptopic7d, stattoptopic7dsentiment

Dataflow pipelines:

  • Refactored IndexerPipeline so that code is easier to read
  • Added Reshuffle transform: allows to break up fused steps, e.g. when getting OOMs
  • Added SplitAB transform: Divide your PCollection in A and B branches by defined ratio
  • Added PartitionedTableRef: Helper for writing into partitioned BigQuery tables
  • Added write to Bigtable to store the dead letter queue/ bad data
  • Added integration with CloudNLP to get their entities and store in BigQuery
  • Added tutorial package and OpinionAnalysisPipeline class as the basis for a future tutorial
  • StatsCalcPipeline: added calculation statements to calculate the new stats tables backing views
  • config.properties: added an override file to control the sirocco-sa config
  • custom-idioms-en.csv: added override file for custom dictionaries

Support for Reddit, support for Dataflow SDK 2.0

20 Jul 17:44
Compare
Choose a tag to compare

Java pipelines:

  • Switched from Dataflow SDK version 1.9.0 to Dataflow SDK version 2.0.0
  • Started using the filesToStage parameter pointing to a single bundled (shaded) jar containing all dependencies
  • Modified the pom.xml file to converge the dependency versions. Added maven-enforcer-plugin maven plugin to enforce convergence.
  • Migrated code to use the new 2.0 SDK
  • Added support for Reddit BigQuery public dataset as an input source

BigQuery schema:

  • Added the MainWebResourceHash and ParentWebResourceHash fields to the document, sentiment, and webresource tables to better support threaded conversations, e.g. in Reddit, HackerNews, email, or discussion boards

Scripts:

  • Added the scripts\run_indexer_reddit_template.sh script to launch an IndexerPipeline parametarized to run a Reddit import.