Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parallelism (again) #9

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

seonghobae
Copy link

Sorry for my misunderstanding, I fix codes properly work what I get reviews in #8 here.

  • Get rid off lines what related with installed.packages().
  • Added requireNamespace('future.apply')

@seonghobae seonghobae requested a review from jwijffels January 31, 2020 16:36
@jwijffels
Copy link
Collaborator

Thanks. Looks fine

@seonghobae
Copy link
Author

Thanks, I'm testing this commits with my real research project; it seems faster if I added multiple clusters with ssh connections with this cluster activation procedure below.

  future::plan(list(
    future::tweak(
      'cluster',
      workers = paste0('mpiuser@192.168.1.', 179:180),
      homogeneous = F
    ),
    future::tweak('multiprocess', workers = max(c(
      1, round(parallel::detectCores(logical = F) * .5)
    )))
  ))

@jwijffels
Copy link
Collaborator

Can you also compare speed wrt pull request #7

@seonghobae
Copy link
Author

seonghobae commented Jan 31, 2020

ing> Can you also compare speed wrt pull request #7

Request #7 has some appropriate speed improvements theoretically within the application of the data.table library using primary keys and have beautiful interfaces. However, I can not find out where I can set the number of parallel cores in request #7. Request #7 uses pbapply to display progress information; however, in my knowledge, pbapply API doesn't support any multi-machine environment (only able to single machine parallelism). That means future.apply can support supercomputing works with the 'future' API, but pbapply can't.

I need the 'multi-machine parallelism' environment to real speed improvements to extensive scientific language research with heterogeneous computing. I have ten machines, including my VPS and Workstations; they made significant speed improvements eight times with #9 even I'm using 1Gbps lines. Without any multicore or multimachine based function; the parallelized apply functions like future.apply library; I can not believe any ideas can improve calculation speed without multi-machine and multicore based parallelized apply functions. The data.table speed up the data processing as a temporary in-memory database, not calculation speed improvements.

Request #8 and #9 include nested parallel structures with replacing all of existed *apply functions, not only textrank_sentence but also all of the textrank:: related functions. Even pbapply supports the cl objects from parallel::makeCluster(), however, that hard to support any nested parallelism. Therefore, they can reach speedup of calculations depends on the number of threads and machines. The main issue doesn't exist among data.table library, the core is parallelized apply functions to solve the issue of #7 with among machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants