Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use server for concurrency of 1 when writing analysis output & readability factors #54

Merged
merged 11 commits into from
Mar 24, 2024
2 changes: 1 addition & 1 deletion src/aggregate_analysis.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import { UASymbol } from '../cjsmin/src/chess';
/**
*
* @param results.json
* @returns final analysis results of all files created and deleted in streaming_partial_decompresser
* @returns final analysis results of all files created and deleted in zst_decompressor.ts
*/
async function aggregateResults(filePath: string) {
const data = JSON.parse(readFileSync(filePath, 'utf-8'));
Expand Down
18 changes: 13 additions & 5 deletions src/zst_decompressor.ts
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the file was renamed, it's hard for me to see what specific changes you've made here. please test this script before merging into main to make sure it's working properly. I'd prefer above all if you could loom/screen record the running of the script on a smaller dataset, and I can test if it works properly. Or I can run it myself - let me know when it's ready to be run (now?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if i can run this

Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import { randomUUID } from 'crypto';
import * as path from 'path';

// TODO: This should use type checking
const fs = require('fs');
Expand Down Expand Up @@ -31,9 +32,11 @@ async function runAnalysis(filePath: string) {
// Run the analysis script
console.log(`Running analysis script on ${filePath}...`);

const analysisFileBasePath = path.resolve(__dirname, '..', 'src');

const child = spawn('ts-node', [
// '/Users/bennyrubanov/Coding_Projects/chessanalysis/src/index_with_decompressor.ts',
`${__dirname}/../../run_metrics_on_input.ts`,
`${analysisFileBasePath}/run_metrics_on_input.ts`,
filePath,
]);

Expand Down Expand Up @@ -82,7 +85,10 @@ const decompressAndAnalyze = async (file, start = 0) => {
const filesProduced = new Set();

// const base_path = `/Users/bennyrubanov/Coding_Projects/chessanalysis/data/${file.replace(
const base_path = `${__dirname}/../../data/${file.replace('.zst', '')}`;
// base_path used to enumerate where new files should go
const base_path = path.resolve(__dirname, '..', 'data', file.replace('.zst', ''));
// for use in decompressionStreamFromFile
const compressedFilePath = path.resolve(__dirname, '..', 'data');

// Create a new file path
const newFilePath = `${base_path}_${randomUUID()}`;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does randomUUID create a random file path number? i prefer counting them. corresponds better with the log here too console.log(Creating file #${file_counter} at ${newFilePath});

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue with counting is that in high concurrency they will not match up. We could do it by game identifier or institute locking if necessary, but for uniquness I use UUID

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do both? one for logs, one for avoiding concurrency/overwrites?

Expand All @@ -108,9 +114,10 @@ const decompressAndAnalyze = async (file, start = 0) => {

// https://www.npmjs.com/package/node-zstandard#decompressionstreamfromfile-inputfile-callback
zstd.decompressionStreamFromFile(
`${__dirname}/../../data/${file}`,
`${compressedFilePath}/${file}`,
Copy link
Collaborator Author

@EllAchE EllAchE Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bennyrubanov fyi these aren't bugs. They are differences between your workflow using ts-node and mine which transpiles to JS first.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood. can we write code to work with both?

(err, result) => {
if (err) return reject(err);
console.log(`Decompressing file located at ${compressedFilePath}/${file}`);

let fileLength = 0;
let batch_files_total_decompressed_size = 0;
Expand Down Expand Up @@ -140,6 +147,7 @@ const decompressAndAnalyze = async (file, start = 0) => {
console.log(
`Total number of chunks decompressed so far: ${total_chunk_counter}`
);

// Increment the file counter
file_counter++;

Expand Down Expand Up @@ -210,7 +218,7 @@ const decompressAndAnalyze = async (file, start = 0) => {

// Function to process all files
const processFiles = async (files: string[]) => {
console.log(`Initiating decompression and analysis of ${files}...`);
console.log(`Initiating decompression and analysis of files: ${files}...`);
console.time('Final Total Compressed File Analysis Execution Time');
for (const file of files) {
await decompressAndAnalyze(file);
Expand All @@ -232,6 +240,6 @@ module.exports = processFiles;
// run if main
if (require.main === module) {
// List of all the database files you want to analyze (these need to be downloaded and in data folder)
const files = ['lichess_db_standard_rated_2013-02.pgn.zst' /*...*/];
const files = ['lichess_db_standard_rated_2013-01.pgn.zst' /*...*/];
processFiles(files);
}
Loading