Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up web analytics with GoAccess #512

Open
acka47 opened this issue Sep 28, 2023 · 15 comments
Open

Set up web analytics with GoAccess #512

acka47 opened this issue Sep 28, 2023 · 15 comments
Assignees

Comments

@acka47
Copy link
Contributor

acka47 commented Sep 28, 2023

Current status is at http://gaia.hbz-nrw.de/stats/

@acka47
Copy link
Contributor Author

acka47 commented Sep 28, 2023

To Dos:

  • Filter out all the URLs that are sideloaded when opening a web page, e.g. for NWBib:
    image
  • Provide yearly overview in addition to monthly view.

@Phu2 Phu2 self-assigned this Sep 28, 2023
@Phu2
Copy link
Contributor

Phu2 commented Sep 28, 2023

The monthly access logs (eg. access_log-20230901) were only partially (01/Aug/2023 — 18/Aug/2023) included in the generated reports due to our preprocessing.
Example: for some reason grep ' www.lobid.org ' /tmp/access_log-20230901 didn't parsed the whole log file and output "grep: /tmp/access_log-20230901: Übereinstimmungen in Binärdatei". Solution: use grep --text or fgrep --text as we are searching for a fixed string.

@dr0i
Copy link
Member

dr0i commented Sep 28, 2023

The binary data in the logs results from the server crash. --text is good!

@Phu2
Copy link
Contributor

Phu2 commented Sep 28, 2023

We could do bzfgrep --text on the compressed logs directly. It takes 7m3,410s compared to 2m51,289s for access_log-20230901(.bz2). What do you think?

@dr0i
Copy link
Member

dr0i commented Sep 28, 2023

bzfgrep --text 👍

@Phu2
Copy link
Contributor

Phu2 commented Oct 9, 2023

I'm stuck at filtering out static files like *.png or *.css, eg.

grep --text -E -v "(robots.txt|.ico|.woff2|.ttf|.webp|.gif|.svg|.jpg|.png|.js|.css)" access_log_lobid-blog-20230901

works fine, but these files are still listed in the report generated by goaccess, see screenshot:

grafik

and i don't know why. @dr0i Can you help?

@dr0i
Copy link
Member

dr0i commented Oct 9, 2023

I couldn't find a flaw in your code . Checking http://gaia.hbz-nrw.de/stats/lobid/access_log_lobid-blog-2023-09-01.html I cannot see e.g. any png. Did you check the proper output?

@Phu2
Copy link
Contributor

Phu2 commented Oct 9, 2023

Due to some obscure replacement of quotes in bash grep commands like the one above don't work as expected. As a workaround i'm calling grep directly in bash (not via variable nor array). Thx, @dr0i .

@Phu2
Copy link
Contributor

Phu2 commented Oct 9, 2023

All monthly and yearly reports for 2023 are beeing generated anew. It will take approx. 10 hours.

@Phu2
Copy link
Contributor

Phu2 commented Oct 10, 2023

Log files from 20230101 should be excluded from the yearly overview for 2023. Include files from 20240101 instead as they contain entries from december 2023.

@Phu2
Copy link
Contributor

Phu2 commented Oct 10, 2023

Yearly overviews are beeing generated anew.

@Phu2
Copy link
Contributor

Phu2 commented Oct 11, 2023

@acka47 Please review again.

@acka47
Copy link
Contributor Author

acka47 commented Dec 4, 2023

This looks good to me now. Thanks! Can we make the stats available openly on the web so that NWBib editors can view it? I guess, there shouldn't be any problems re. privacy.

Furthermore, we will also need this for RPB (https://rpb.lobid.org/), RPPD (https://rppd.lobid.org/) and BiblioVino (https://wein.lobid.org/). LBZ partners just asked, see RPB-42. Should I open a separate issue for this or do we add this in the context of this issue?

@acka47
Copy link
Contributor Author

acka47 commented Jan 8, 2024

I just talked about this issue with @Phu2 . And here are the next steps:

@Phu2
Copy link
Contributor

Phu2 commented Jan 12, 2024

Things to consider:

  • Monthly reports are retrospective, eg. for january stats see access_log_nwbib-de-2023-02-01.html
  • From 2023-11 onwards reports are based on dedicated access logs for each service. Before that we had one single large access log from which we filtered out the requests for each service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants