Releases: GoogleCloudPlatform/dataflow-opinion-analysis
Release 0.6.6
- Adding v1 version of a Dictionary Builder Pipeline to calculate ngram stats
- Added option to read from a Reddit archive file
- Using Sirocco 1.0.8
Release 0.6.5
Release 0.6.5:
Updated the models\RedditEngagement.ipynb notebook:
- Removed dependencies on gfile
- Added a mode
current_run_in_colab=True
for running the notebook in Colab or Kaggle - Made more resillient to missing packages in the public execution environments like Colab and Kaggle
- Added a mode
current_read_from_bq=True
where users can decide whether to get their training data from BigQuery or from snapshot files - Uploaded snapshot files to models\data so that users without BigQuery access can still run the model
- Made the notebook available on Kaggle https://www.kaggle.com/datancoffee/predicting-community-engagement-on-reddit/
Release 0.6.4
Release 0.6.4
- Switched sourceRecordFile option to use TextIO again, now that TextIO supports custom delimeters.
- Added --recordDelimiters option to use with the new withDelimeters option of TextIO. Accepts a list of integers representing record delimeters. If this option is not specified, then the default TextIO delimeters '\r', '\n', and "\r\n" are used. Set this option to a ASCII character that you know does not appear in input file, e.g. 30 - the decimal representation of ASCII Record Separator, to import the entire file as one record.
- CSV import support: Added --readAsCSV option to use with --sourceRecordFile. Mandatory parameter for CSV inputs is --textColumnIdx, which is a 0-based column index in the input file, specifying the column where the text content resides. An optional parameter --collectionItemIdIdx=0 is the column index for a unique ID that will be written into the output. The collectionItemIdIdx can then be used to join the indexed text with other records. Important: the --recordDelimiters parameter needs to be set to character not present in input so that the entire file is read at once.
Examples:
Base case: importing text files
mvn compile exec:java
-Dexec.mainClass=com.google.cloud.dataflow.examples.opinionanalysis.IndexerPipeline
-Dexec.args="--project=$PROJECT_ID
--runner=DataflowRunner
--maxNumWorkers=10
--workerMachineType=n1-standard-2
--stagingLocation=gs://$GCS_BUCKET/staging/
--tempLocation=gs://$GCS_BUCKET/temp/
--streaming=false
--autoscalingAlgorithm=THROUGHPUT_BASED
--bigQueryDataset=opinions
--writeTruncate=true
--processedUrlHistorySec=130000
--wrSocialCountHistoryWindowSec=610000
--ratioEnrichWithCNLP=0
--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.txt
--indexAsShorttext=false
"
Importing text files, the entire file treated as its own record:
--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.txt
--readAsPropertyBag=true
--recordDelimiters=30
Importing text files, each line (separated by new line character) a separate record.
--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.txt
--readAsPropertyBag=true
Importing CSV files:
--sourceRecordFile=true
--inputFile=gs://$GCS_BUCKET/input/*.csv
--readAsCSV=true
--textColumnIdx=$ZERO_BASED_COLUMNINDEX_IN_CSV_FILE
--collectionItemIdIdx=$OPTIONAL_UNIQUE_ID_COLUMNINDEX_IN_CSV_FILE
--recordDelimiters=30
Release 0.6.3
Release 0.6.3:
- Upgraded to Dataflow SDK version 2.2.0
Release 0.6.2
Release 0.6.2
Java code:
- Fixed an issue in the FileIndexerPipeline when an extra newline was added in the output CSV file
Release 0.6.1: MetaFields and CSV import-export solution
Release 0.6.1
BigQuery:
- added the MetaFields field to webresource table for source-specific metadata fields accessible via SAFE_OFFSET(0-base index).
Java code:
- Added propagation of MetaFields field from sources all the way to the BigQuery dataset
- added the solutions package and as the first solution added the FileIndexerPipeline that takes a CSV file, indexes it, and writes a CSV file
- added dependency on Commons CSV package that helps with CSV processing
Release 0.5.1
Major changes: perf improvements in BigQuery dataset (incl materialized views, partitioning) , refactored IndexerPipeline for easier understanding
BigQuery:
- added partitioning to raw fact tables document, sentiment, webresource, wrsocialcount and some stats table that have daily snapshots statstoryimpact, stattopic
- reload_metadata_template.sh: added script to populate the metadata "topic" table
- Added topic table to store blocked topics
- Materialized many former views to tables for faster querying: statdomainopinions, statstoryrank, stattopstory7d, stattoptopic7d, stattoptopic7dsentiment
Dataflow pipelines:
- Refactored IndexerPipeline so that code is easier to read
- Added Reshuffle transform: allows to break up fused steps, e.g. when getting OOMs
- Added SplitAB transform: Divide your PCollection in A and B branches by defined ratio
- Added PartitionedTableRef: Helper for writing into partitioned BigQuery tables
- Added write to Bigtable to store the dead letter queue/ bad data
- Added integration with CloudNLP to get their entities and store in BigQuery
- Added tutorial package and OpinionAnalysisPipeline class as the basis for a future tutorial
- StatsCalcPipeline: added calculation statements to calculate the new stats tables backing views
- config.properties: added an override file to control the sirocco-sa config
- custom-idioms-en.csv: added override file for custom dictionaries
Support for Reddit, support for Dataflow SDK 2.0
Java pipelines:
- Switched from Dataflow SDK version 1.9.0 to Dataflow SDK version 2.0.0
- Started using the filesToStage parameter pointing to a single bundled (shaded) jar containing all dependencies
- Modified the pom.xml file to converge the dependency versions. Added maven-enforcer-plugin maven plugin to enforce convergence.
- Migrated code to use the new 2.0 SDK
- Added support for Reddit BigQuery public dataset as an input source
BigQuery schema:
- Added the MainWebResourceHash and ParentWebResourceHash fields to the document, sentiment, and webresource tables to better support threaded conversations, e.g. in Reddit, HackerNews, email, or discussion boards
Scripts:
- Added the scripts\run_indexer_reddit_template.sh script to launch an IndexerPipeline parametarized to run a Reddit import.