Skip to content
This repository has been archived by the owner on Apr 9, 2019. It is now read-only.

Commit

Permalink
- Added a possibility to use the service for scraping (X-Render-Url h…
Browse files Browse the repository at this point in the history
…eader)

- Added a parameter "wait_for_process_time" that makes the service more friendly for load balancers
  • Loading branch information
Free Man committed Apr 14, 2018
1 parent eb1d9f4 commit d4e4806
Show file tree
Hide file tree
Showing 5 changed files with 78 additions and 27 deletions.
32 changes: 31 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,26 @@ Useful for SEO and indexing by social media. Works on LOW-END-BOXES, does not co
Create your `config.php` file in the root directory to override default settings.
The `config.dist.php` is a default configuration file and can be copied in place of `config.php`

#### Reference

- **skipped_headers**: List of headers to not forward (PhantomJS only)
- **allowed_domains**: Accept requests only for those domains (all browsers)
- **delay**: Give the browser X seconds before getting the result from it (PhantomJS only)
- **timeout**: Give the browser maximum amount of time to render (PhantomJS only)
- **with_images**: Allow to render images? Mostly its not necessary (all browsers)
- **debug**: Print debugging information instead of the result (PhantomJS only)
- **renderer**: Decides if we want to use phantomjs or chromium
- **chromium_binary**: The command name for Chromium, could be eg. chrome, chromium, google-chrome-beta or some path
- **window_size**: Browser window size (Chromium only)
- **open_process_limit**: Limit the amount of workers, so the server will not blow up (Chromium only)
- **wait_for_process_time**: Amount of seconds to wait for a process when the maximum of opened browsers (defined in open_process_limit) is reached
After this time there will be a 503 returned.

## Installation

`make deploy`

## Browser differencies
## Browser differences

| Feature | Chromium | PhantomJS |
| ------------- | ------------- | ----- |
Expand All @@ -49,6 +64,21 @@ The `config.dist.php` is a default configuration file and can be copied in place
The usage is simple, just redirect any request to this service, it should go through index.php
Use the webserver to redirect requests properly and validate which domains are allowed.

#### Render a different page than the request

You can render any page eg. facebook.com by providing the URL in the header.

Example request:

```
GET /
X-Render-Url: https://www.facebook.com/events/209461189653825/
```

This allows to use the prerender service for scrapping any pages in a microservice architecture,
with scaled services behind the load balancer.

## Cache regeneration

The service keeps the history of successful requests from robots, so later there is a possibility to click
Expand Down
19 changes: 10 additions & 9 deletions config.dist.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,14 @@
'x-frontend-prerenderer',
'user-agent'
],
'allowed_domains' => ['localhost'],
'delay' => 5, // PhantomJS
'timeout' => 0, // PhantomJS
'with_images' => false, // Chromium, PhantomJS
'debug' => false, // PhantomJS
'renderer' => 'chromium', // decides which browser to use (values: chromium|phantomjs)
'chromium_binary' => 'chromium-browser', // Chromium (examples: chromium|chrome|chromium-browser)
'window_size' => '1920x1080', // Chromium
'open_process_limit' => 3 // Chromium
'allowed_domains' => ['localhost'],
'delay' => 5, // PhantomJS
'timeout' => 0, // PhantomJS
'with_images' => false, // Chromium, PhantomJS
'debug' => false, // PhantomJS
'renderer' => 'chromium', // decides which browser to use (values: chromium|phantomjs)
'chromium_binary' => 'chromium-browser', // Chromium (examples: chromium|chrome|chromium-browser)
'window_size' => '1920x1080', // Chromium
'open_process_limit' => 3, // Chromium
'wait_for_process_time' => 4 // Chromium
];
23 changes: 19 additions & 4 deletions index.php
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,11 @@

require __DIR__ . '/vendor/autoload.php';

function emitResponse (Response $response, Request $request, VisitedUrlsManager $manager)
function emitResponse (Response $response, Request $request, VisitedUrlsManager $manager = null)
{
$headers = $response->headers->all();

if ($response->getStatusCode() >= 200 && $response->getStatusCode() < 400) {
if ($manager !== null && $response->getStatusCode() >= 200 && $response->getStatusCode() < 400) {
$manager->addUrl($request->getRequestUri());
}

Expand Down Expand Up @@ -76,7 +76,22 @@ function emitResponse (Response $response, Request $request, VisitedUrlsManager
$config = $container->get(ConfigurationRepository::class);

// handle the request
$request = Request::createFromGlobals();
$request = Request::createFromGlobals();
$customUrl = $request->headers->get('X-Render-Url');
$isForwardedRequest = false;

if ($customUrl && filter_var($customUrl, FILTER_VALIDATE_URL)) {
$request = Request::create(
$customUrl,
'GET',
[],
$request->cookies->all(),
$request->files->all(),
$request->server->all()
);

$isForwardedRequest = true;
}

// prevalidation
if (!empty($config->get('allowed_domains')) && !in_array($request->getHttpHost(), $config->get('allowed_domains'), true)) {
Expand All @@ -85,4 +100,4 @@ function emitResponse (Response $response, Request $request, VisitedUrlsManager
}

$response = $controller->renderAction($request);
emitResponse($response, $request, $manager);
emitResponse($response, $request, $isForwardedRequest === false ? $manager : null);
28 changes: 16 additions & 12 deletions src/Controller/ChromiumRenderController.php
Original file line number Diff line number Diff line change
Expand Up @@ -27,16 +27,23 @@ class ChromiumRenderController implements RenderInterface
*/
private $openProcessLimit;

/**
* @var int $waitForProcessTime
*/
private $waitForProcessTime;

public function __construct(
string $chromeBinary = 'chromium',
bool $withImages = false,
string $windowSize = '1920x1080',
int $openProcessLimit = 3)
int $openProcessLimit = 3,
int $waitForProcessTime = 4)
{
$this->chromeBinary = $chromeBinary;
$this->withImages = $withImages;
$this->windowSize = $windowSize;
$this->openProcessLimit = $openProcessLimit;
$this->chromeBinary = $chromeBinary;
$this->withImages = $withImages;
$this->windowSize = $windowSize;
$this->openProcessLimit = $openProcessLimit;
$this->waitForProcessTime = $waitForProcessTime;
}

/**
Expand All @@ -58,7 +65,7 @@ public function renderAction(Request $request): Response
' --dump-dom --window-size=' . $this->windowSize . ' "' . $url . '"';

if (!$this->canSpawnNewProcess()) {
return new Response('Error: Too many requests', Response::HTTP_TOO_MANY_REQUESTS);
return new Response('Error: Too many requests. Open: ' . $this->getOpenedProcessesCount(), Response::HTTP_SERVICE_UNAVAILABLE);
}

return new Response($this->executeCommand($command));
Expand Down Expand Up @@ -90,15 +97,12 @@ private function buildProxyArgument(Request $request): string

private function canSpawnNewProcess(): bool
{
return true;
if ($this->getOpenedProcessesCount() >= $this->openProcessLimit) {
sleep(4);
}

if ($this->getOpenedProcessesCount() >= $this->openProcessLimit) {
return false;
sleep($this->waitForProcessTime);
}

return true;
return $this->getOpenedProcessesCount() < $this->openProcessLimit;
}

private function getOpenedProcessesCount(): int
Expand Down
3 changes: 2 additions & 1 deletion src/DependencyInjection/Services.php
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,8 @@
$config->get('chromium_binary', 'chromium'),
$config->get('with_images', false),
$config->get('window_size', '1920x1080'),
$config->get('open_process_limit', 3)
$config->get('open_process_limit', 3),
$config->get('wait_for_process_time', 4)
);
}
]);

0 comments on commit d4e4806

Please sign in to comment.