forked from webrecorder/browsertrix-behaviors
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
KGX747
authored and
KGX747
committed
Mar 28, 2024
1 parent
c0383a2
commit 5b704f7
Showing
6 changed files
with
346 additions
and
241 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,124 +1,41 @@ | ||
# Browsertrix Behaviors | ||
|
||
<details> | ||
<summary><b>Behavior Testing Results</b></summary> | ||
This is a fork of [browsertrix-behaviors](https://github.com/webrecorder/browsertrix-behaviors), with added functionality for harvesting special features of Luxembourgish websites. In this repository, we have included the following behavior: | ||
|
||
[](https://github.com/webrecorder/browsertrix-behaviors/actions/workflows/autoscroll.yaml) | ||
* *Woodee* : dynamic flipbook/brochure page flipping and harvesting. | ||
|
||
[](https://github.com/webrecorder/browsertrix-behaviors/actions/workflows/autoplay-youtube.yaml) | ||
For building/injecting the behaviors into Browsertrix Crawler, please refer to the [original README](original_README.md). | ||
|
||
[](https://github.com/webrecorder/browsertrix-behaviors/actions/workflows/autoplay-vimeo.yaml) | ||
|
||
[](https://github.com/webrecorder/browsertrix-behaviors/actions/workflows/instagram.yaml) | ||
# Compiling | ||
|
||
[](https://github.com/webrecorder/browsertrix-behaviors/actions/workflows/twitter.yaml) | ||
You need `yarn` compile the behaviors into a package. This package will be created in the `dist` folder and is called `behaviors.js`. | ||
|
||
[](https://github.com/webrecorder/browsertrix-behaviors/actions/workflows/twitter-logged-in.yaml) | ||
Go to the root folder of the project and run: | ||
|
||
[](https://github.com/webrecorder/browsertrix-behaviors/actions/workflows/facebook-page.yaml) | ||
yarn run build | ||
|
||
[](https://github.com/webrecorder/browsertrix-behaviors/actions/workflows/facebook-photos.yaml) | ||
Now go to the Browsertrix Crawler folder, and copy the file `behaviors.js` into the root folder: | ||
|
||
[](https://github.com/webrecorder/browsertrix-behaviors/actions/workflows/facebook-videos.yaml) | ||
cp ../browsertrix-behaviors/dist/behaviors.js . | ||
|
||
</details> | ||
Now change the user to `webcrawler`. Then recompile the crawler and tag it: | ||
|
||
A set of behaviors injected into the browser to perform certain operations on a page, such as scrolling, fetching additional URLs, or performing | ||
customized actions for social-media sites. | ||
podman build -t webrecorder/browsertrix-crawler:local . | ||
podman tag <image_id> localhost:5000/webrecorder/browsertrix-crawler:local | ||
|
||
Finally, push it to the registry: | ||
|
||
## Usage | ||
podman push localhost:5000/webrecorder/browsertrix-crawler:local | ||
|
||
No need to restart podman as Browserrtix Cloud will automnatically spawn the next crawlers from the regidstry with the `local` tag. | ||
|
||
The behaviors are compiled into a single file, `dist/behaviors.js`, which can be injected into any modern browser to load the behavior system. | ||
No additional dependencies are required, and the behaviors file can be pasted directly into your browser. | ||
|
||
The file can injected in a number of ways, using tools like puppeteer/playwright, a browser extension content script, or even a devtools Snippet, or even a regular | ||
`<script>` tag. Injecting the behaviors into the browser is outside the scope of this repo, but here are a few ways you can try the behaviors: | ||
# Testing | ||
|
||
### Copy & Paste Behaviors (for testing) | ||
Copy the contents of the compiled `behaviors.js` file to the browser's dev tools console, then execute: | ||
|
||
To test out the behaviors in your current browser, you can: | ||
self.__bx_behaviors.run(); | ||
|
||
|
||
1. Go to the [dist/behaviors.js](dist/behaviors.js) | ||
2. Copy the file (it is minified so will be on one line). | ||
3. Open a web page, such as one that has a custom behavior, like: [https://twitter.com/webrecorder_io](https://twitter.com/webrecorder_io) | ||
4. Open devtools console, and paste the script | ||
5. Enter `self.__bx_behaviors.run();` | ||
6. You should see the Twitter page automatically scrolling and visiting tweets. | ||
|
||
|
||
### Use Puppeteer | ||
|
||
To integrate behaviors into an automated workflow, here is an short example using puppeteer. | ||
|
||
```javascript | ||
// assumes browsertrix-behaviors is installed as a node module | ||
const behaviors = fs.readFileSync("./node_modules/browsertrix-behaviors/dist/behaviors.js", "utf-8"); | ||
|
||
await page.evaluateOnNewDocument(behaviors + ` | ||
self.__bx_behaviors.init({ | ||
autofetch: true, | ||
autoplay: true, | ||
autoscroll: true, | ||
siteSpecific: true, | ||
}); | ||
`); | ||
|
||
# call and await run on top frame and all child iframes | ||
await Promise.allSettled(page.frames().map(frame => frame.evaluate("__self.bx_behaviors.run()"))); | ||
|
||
``` | ||
|
||
|
||
see [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler) for a complete working example of injection using puppeteer. | ||
|
||
## Initialization | ||
|
||
Once the behavior script has been injected, run: `__bx_behaviors.init(opts)` to initialize which behaviors should be used. `opts` includes several boolean options: | ||
|
||
- `autofetch` - enable background autofetching of img srcsets, and stylesheets (when possible) | ||
- `autoplay` - attempt to automatically play and video/audio, or fetch the URLs for any video streams found on the page. | ||
- `autoscroll` - attempt to repeatedly scroll the page to the bottom as far as possible. | ||
- `timeout` - set a timeout (in ms) for all behaviors to finish. | ||
- `siteSpecific` - run a site-specific behavior if available. | ||
- `log` - a function or global string to receive log messages from behaviors | ||
|
||
### Background Behaviors | ||
|
||
The `autoplay` and `autofetch` are background behaviors, and will run as soon as `init(...)` is called, or as soon as the page is loaded. | ||
Background behaviors do not change the page, but attempt to do additional fetching to ensure more resources are loaded. | ||
Background behaviors can be used with user-directed browsing, and can also be loaded in any iframes on the page. | ||
|
||
|
||
### Active Behaviors | ||
|
||
The `autoscroll` and `siteSpecific` enable 'active' behaviors, modify the page, and run until they are finished or timeout. | ||
|
||
If both `siteSpecific` and `autoscroll` is specified, only one behavior is run. If a site-specific behavior exists, it takes precedence over auto-scroll, otherwise, auto-scroll is useed. | ||
|
||
|
||
Currently, the available site-specific behaviors are available for: | ||
|
||
|
||
Additional site-specific behaviors can be added to the [site](./src/site) directory. | ||
|
||
To run the active behavior, call: `await __bx_behaviors.run()` after init. | ||
|
||
Alternatively, calling `await __bx_behaviors.run(opts)` will also call `init(opts)` if init has not been called before. | ||
|
||
The promised returned by run will wait for the active behavior to finish, for the `timeout` time to be reached. It will also ensure any pending autoplay requests are started for the `autoplay` behavior. | ||
|
||
## Logging | ||
|
||
By default, behaviors will log debug messages to `console.log`. To disable this logging, set `log: false` in the init options. | ||
|
||
This param can also be set to a custom init function by string. For example, to have behavior event messages be passed to `self.my_log`, set `log: "my_log"` in the options. | ||
|
||
Additional logging options may be added soon. | ||
|
||
## Building | ||
|
||
Browsertrix Behaviors uses webpack to build. Run `yarn run build` to build the latest `dist/behaviors.js`. | ||
|
||
Shared utility functions can be added to `utils.js` while site-specific behavior can be added to `lib/site`. |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# Browsertrix Behaviors | ||
|
||
A set of behaviors injected into the browser to perform certain operations on a page, such as scrolling, fetching additional URLs, or performing customized actions for social-media sites. | ||
|
||
## Usage | ||
|
||
The behaviors are compiled into a single file, `dist/behaviors.js`, which can be injected into any modern browser to load the behavior system. | ||
No additional dependencies are required, and the behaviors file can be pasted directly into your browser. | ||
|
||
The file can injected in a number of ways, using tools like puppeteer/playwright, a browser extension content script, or even a devtools Snippet, or even a regular | ||
`<script>` tag. Injecting the behaviors into the browser is outside the scope of this repo, but here are a few ways you can try the behaviors: | ||
|
||
### Copy & Paste Behaviors (for testing) | ||
|
||
To test out the behaviors in your current browser, you can: | ||
|
||
1. Go to the [dist/behaviors.js](dist/behaviors.js) | ||
2. Copy the file (it is minified so will be on one line). | ||
3. Open a web page, such as one that has a custom behavior, like: [https://twitter.com/webrecorder_io](https://twitter.com/webrecorder_io) | ||
4. Open devtools console, and paste the script | ||
5. Enter `self.__bx_behaviors.run();` | ||
6. You should see the Twitter page automatically scrolling and visiting tweets. | ||
|
||
|
||
### Use Puppeteer | ||
|
||
To integrate behaviors into an automated workflow, here is an short example using puppeteer. | ||
|
||
```javascript | ||
// assumes browsertrix-behaviors is installed as a node module | ||
const behaviors = fs.readFileSync("./node_modules/browsertrix-behaviors/dist/behaviors.js", "utf-8"); | ||
|
||
await page.evaluateOnNewDocument(behaviors + ` | ||
self.__bx_behaviors.init({ | ||
autofetch: true, | ||
autoplay: true, | ||
autoscroll: true, | ||
siteSpecific: true, | ||
}); | ||
`); | ||
|
||
# call and await run on top frame and all child iframes | ||
await Promise.allSettled(page.frames().map(frame => frame.evaluate("__self.bx_behaviors.run()"))); | ||
|
||
``` | ||
|
||
|
||
see [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler) for a complete working example of injection using puppeteer. | ||
|
||
## Initialization | ||
|
||
Once the behavior script has been injected, run: `__bx_behaviors.init(opts)` to initialize which behaviors should be used. `opts` includes several boolean options: | ||
|
||
- `autofetch` - enable background autofetching of img srcsets, and stylesheets (when possible) | ||
- `autoplay` - attempt to automatically play and video/audio, or fetch the URLs for any video streams found on the page. | ||
- `autoscroll` - attempt to repeatedly scroll the page to the bottom as far as possible. | ||
- `timeout` - set a timeout (in ms) for all behaviors to finish. | ||
- `siteSpecific` - run a site-specific behavior if available. | ||
- `log` - a function or global string to receive log messages from behaviors | ||
|
||
### Background Behaviors | ||
|
||
The `autoplay` and `autofetch` are background behaviors, and will run as soon as `init(...)` is called, or as soon as the page is loaded. | ||
Background behaviors do not change the page, but attempt to do additional fetching to ensure more resources are loaded. | ||
Background behaviors can be used with user-directed browsing, and can also be loaded in any iframes on the page. | ||
|
||
|
||
### Active Behaviors | ||
|
||
The `autoscroll` and `siteSpecific` enable 'active' behaviors, modify the page, and run until they are finished or timeout. | ||
|
||
If both `siteSpecific` and `autoscroll` is specified, only one behavior is run. If a site-specific behavior exists, it takes precedence over auto-scroll, otherwise, auto-scroll is useed. | ||
|
||
|
||
Currently, the available site-specific behaviors are available for: | ||
|
||
|
||
Additional site-specific behaviors can be added to the [site](./src/site) directory. | ||
|
||
To run the active behavior, call: `await __bx_behaviors.run()` after init. | ||
|
||
Alternatively, calling `await __bx_behaviors.run(opts)` will also call `init(opts)` if init has not been called before. | ||
|
||
The promised returned by run will wait for the active behavior to finish, for the `timeout` time to be reached. It will also ensure any pending autoplay requests are started for the `autoplay` behavior. | ||
|
||
## Logging | ||
|
||
By default, behaviors will log debug messages to `console.log`. To disable this logging, set `log: false` in the init options. | ||
|
||
This param can also be set to a custom init function by string. For example, to have behavior event messages be passed to `self.my_log`, set `log: "my_log"` in the options. | ||
|
||
Additional logging options may be added soon. | ||
|
||
## Building | ||
|
||
Browsertrix Behaviors uses webpack to build. Run `yarn run build` to build the latest `dist/behaviors.js`. | ||
|
||
Shared utility functions can be added to `utils.js` while site-specific behavior can be added to `lib/site`. |
Oops, something went wrong.