Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse each file only once #200

Merged
merged 7 commits into from
Jan 10, 2025

Conversation

mkljakubowski
Copy link
Contributor

@mkljakubowski mkljakubowski commented Jan 4, 2025

With this change I was able to reduce compilation time by 85% vs the original (about 40% vs round 1).

Comment on lines -103 to -106
val importedSchemaOrProtocols = importedFiles.flatMap(file => {
val res = infile.getName.split("\\.").last match {
case "avro" =>
val gdr = new GenericDatumReader[GenericRecord]
val dfr = new DataFileReader(infile, gdr)
val schemas = unUnion(dfr.getSchema)
schemas.map(Left(_))
case "avsc" =>
val schemas = tryParse(infile, parser)
schemas.map(Left(_))
case "avpr" =>
val protocol = Protocol.parse(infile)
List(Right(protocol))
case "avdl" =>
val idlParser = new Idl(infile, classLoader)
val protocol = idlParser.CompilationUnit()
/**
* IDLs may refer to types imported from another file. When converted
* to protocols, the imported types that share the IDL's namespace
* cannot be distinguished from types defined within the IDL, yet
* should not be generated as subtypes of the IDL's ADT and should
* instead be generated in its own namespace. So, strip the protocol
* of all imported types and generate them separately.
*/
val importedFiles = IdlImportParser.getImportedFiles(infile, classLoader)
val importedSchemaOrProtocols = importedFiles.flatMap { file =>
if (!processedFiles.contains(file.getCanonicalPath)) {
processedFiles += file.getCanonicalPath
val importParser = new Parser() // else attempts to redefine schemas
getSchemaOrProtocols(file, format, classStore, classLoader, importParser)
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the code potentially revisits the same file multiple times

@julianpeeters
Copy link
Owner

Impressive speed-up, thanks

I'm inclined to release this as 2.11.0. Let me know if you'd prefer to add more to this PR, otherwise I'll release this sometime next week

@mkljakubowski
Copy link
Contributor Author

Just found another one of this uncached recursive calls, but this is all I have for now. It cuts down 92% time vs original in my case.
I have a few more ideas, but they might reorder imports, so tests need to be updated, which is rather time consuming. I'll try to look into it as some point. The current performance is already good enough for my use case.

@julianpeeters julianpeeters merged commit a3c6850 into julianpeeters:main Jan 10, 2025
6 checks passed
@julianpeeters
Copy link
Owner

thanks for round 2

avrohugger 2.11.0 and sbt-avrohugger 2.11.0 are up on sonatype and synching to maven central now

@mkljakubowski
Copy link
Contributor Author

part of #199

@mkljakubowski mkljakubowski deleted the speedup-round2 branch January 15, 2025 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants