Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect URLs for non-GitHub repositories #48

Open
sivanahamer opened this issue Feb 19, 2022 · 17 comments
Open

Incorrect URLs for non-GitHub repositories #48

sivanahamer opened this issue Feb 19, 2022 · 17 comments

Comments

@sivanahamer
Copy link

Hello!

There is an issue with generating the URL for some non-GitHub repositories.

For example, the project with URI gitlab.freedesktop.org_libinput_libinput. The following code...

print(oscar.Project("gitlab.freedesktop.org_libinput_libinput").url.decode())

Produces the following output...

https://github.com/gitlab.freedesktop.org/libinput_libinput

However, the project URL is https://gitlab.freedesktop.org/libinput/libinput.

This error can happen to any project that is not hosted in the list of projects in URL_PREFIXES. Updating the list of projects with those within the list seems unmanageable, as the are many non-GitHub repositories in WoC.

@sivanahamer
Copy link
Author

I do have a fix for the issue, though I don't have permissions to add the corresponding commit with the pull request in this repository.

@user2589
Copy link
Member

You can post it here. I'll issue an update this weekend.

Thank you

@sivanahamer
Copy link
Author

sivanahamer commented Feb 19, 2022

In oscar.pyx removed the URLs with the same prefix:

URL_PREFIXES = {
    b'drupal.com': b'git.drupal.org',
    b'kde.org': b'anongit.kde.org',
    b'sourceforge.net': b'git.code.sf.net/p'
}

And then, added the case in the URL property within the Project:

class Project(_Base):
    ...
    @cached_property
    def url(self):
        """ Get the URL for a given project URI
        >>> Project('CS340-19_lectures').url
        'http://github.com/CS340-19/lectures'
        """
        prefix, body = self.uri.split(b'_', 1)
        if prefix == b'sourceforge.net':
            platform = URL_PREFIXES[prefix]
        elif prefix in URL_PREFIXES and b'_' in body:
            platform = URL_PREFIXES[prefix]
            body = body.replace(b'_', b'/', 1)
        elif b'_' in body:
            platform = prefix
            body = body.replace(b'_', b'/', 1)
        else:
            platform = b'github.com'
            body = self.uri.replace(b'_', b'/', 1)
        return b'/'.join((b'https:/', platform, body))

P.D: I'm not sure if all URLs will have only one _ to replace.

@user2589
Copy link
Member

This will brake github urls, as they always include an underscore separating the username from the project name. I'll think of some other solution, though.

@sivanahamer
Copy link
Author

Maybe a solution could be to check if the prefix has “.com”, “.net”, or other common internet domains? Like a url checker.

@user2589
Copy link
Member

This is exactly what I am doing now, but there are a few border cases to deal with

@audrism
Copy link
Contributor

audrism commented Feb 20, 2022

Here is how git url is transformed into project id (bb, gl, and dr are just hostnames for ssh as indicated):

s|^bb:([^/]+)/|bitbucket.org_$1_|;
s|^gl:([^/]+)/|gitlab.com_$1_|;
s|^dr:([^/]+)/|drupal.com_$1_|;
s|^https://([^/]*)/([^/]*)/|$1_$2_|;
s|^https://([^/]*)/|$1_|;

In words: two first / are replaced by _, if there is only one, it is also replaced by _
Simple rules to determine non-github origin:

  • contains . or /
  • contains one of the prefixes listed below.

some examples:

https://salsa.debian.org/science-team/plplot
https://gitlab.inria.fr/allgo/notebooks/speads
https://gitlab.freedesktop.org/aphogat/raytracing
https://pagure.io/alien
https://gitlab.inria.fr/guix-hpc/website
https://gitlab.inria.fr/grid5000/kwollect-dashboard
https://git.eclipse.org/r/henshin/org.eclipse.emft.henshin.git/
https://android.googlesource.com/platform/packages/apps/Test/connectivity

A list of prefixes update in Version T, U, S, Q:

android.googlesource.com
anongit.kde.org
blitiri.com.ar
code.ill.fr
code.qt.io
fedorapeople.org
forgemia.inra.fr
framagit.org
gcc.gnu.org
git.alpinelinux.org
git.bioconductor.org
git.code.sf.net
git.eclipse.org
git.kernel.org
git.openembedded.org
git.pleroma.social
git.postgresql.org
git.savannah.gnu.org
git.savannah.nongnu.org
git.torproject.org
git.unicaen.fr
git.unistra.fr
git.xfce.org
git.yoctoproject.org
git.zx2c4.com
gitbox.apache.org
gite.lirmm.fr
gitlab.adullact.net
gitlab.cerema.fr
gitlab.common-lisp.net
gitlab.fing.edu.uy
gitlab.freedesktop.org
gitlab.gnome.org
gitlab.huma-num.fr
gitlab.inria.fr
gitlab.irstea.fr
gitlab.ow2.org
invent.kde.org
notabug.org
pagure.io
repo.or.cz
salsa.debian.org

@sivanahamer
Copy link
Author

sivanahamer commented Feb 22, 2022

I was trying to create a url checker. A possibility could be to use a regex such as \.[A-Za-z]+$. Also, there is a library that parses public domains, if needed be.

In code, with just the regex, it would be...

class Project(_Base):
    ...
   @cached_property
    def url(self):
        """ Get the URL for a given project URI
        >>> Project('CS340-19_lectures').url
        'http://github.com/CS340-19/lectures'
        """
        prefix, body = self.uri.split(b'_', 1)
        if prefix == b'sourceforge.net':
            platform = URL_PREFIXES[prefix]
        elif prefix in URL_PREFIXES and b'_' in body:
            platform = URL_PREFIXES[prefix]
            body = body.replace(b'_', b'/', 1)
        elif re.search("\.[A-Za-z]+$", prefix.decode()):
            platform = prefix
            body = body.replace(b'_', b'/', 1)
        else:
            platform = b'github.com'
            body = self.uri.replace(b'_', b'/', 1)
        return b'/'.join((b'https:/', platform, body))

I'm not sure if the other cases for sourceforge and the repositories in URL_PREFIXES are needed. If so, URL_PREFIXES could be deleted and the rest of the code could be left as...

class Project(_Base):
    ...
   @cached_property
    def url(self):
        """ Get the URL for a given project URI
        >>> Project('CS340-19_lectures').url
        'http://github.com/CS340-19/lectures'
        """
        prefix, body = self.uri.split(b'_', 1)
        if re.search("\.[A-Za-z]+$", prefix.decode()):
            platform = prefix
            body = body.replace(b'_', b'/', 1)
        else:
            platform = b'github.com'
            body = self.uri.replace(b'_', b'/', 1)
        return b'/'.join((b'https:/', platform, body))

@audrism
Copy link
Contributor

audrism commented Feb 26, 2022

Yes, prefix array is superfluous and, furthermore, needs to be changed with each new forge added.
May be worth checking what characters are allowed in a top-level domain as the proposed code assumes no numeric characters.

@sivanahamer
Copy link
Author

I guess from the list of TLD from here and here it seems that the TLD have no numbers, however there are some that have other Unicode characters.

@sivanahamer
Copy link
Author

Is there any way I can further help with this issue?

@audrism
Copy link
Contributor

audrism commented Mar 4, 2022

Many thanks: would you mind submitting a fix via PR to get the right author for the fix?
(for now, lets not worry about unicode in top-level domain names). I can apply your patch as well if you prefer.
The commit comment should start with FIX issue #48 - url's for non-github repositoried

@sivanahamer
Copy link
Author

I believe I have no rights to commit in the repository.

@user2589
Copy link
Member

user2589 commented Mar 4, 2022

@sivanahamer This issue is not that hard to fix, but for the time it is blocked by another change. Apparently there was a fairly recent change in relations, they started to use deduplicated project URL (capital P vs lowercase p). I need to update several more bits of code to make it work, and was holding only because of that.

Unfortunately, I only have a few hours per week to this project - but I'll try to finish it this weekend.

@audrism
Copy link
Contributor

audrism commented Mar 4, 2022

@user2589 There is also a and A for authors/aliased authors. Just to be clear, the lowercase (not aliased) versions are not going away, however.

I am not sure how you plan to handle it, but it may be simplest to have a separate class for project and author for aliased versions (as a completely different object) as combining both in a single class may add complications.

@gaokai320
Copy link
Contributor

I encountered a case where the repository's owner name contains an underscore. For example: https://gitlab.com/rki_bioinformatics/DeePaC is represented as gitlab.com_rki_bioinformatics_DeePaC.
Besides, GitHub used to allow username contain an underscore
I think we may consider a more suitable separator between owner name and repository name to allow invertible transformation between repository url and woc repository representation. Or we may maintain another database (or add an extra url field to exist P.metadata collection) to map repository url to woc's repository representation.

@audrism
Copy link
Contributor

audrism commented Mar 31, 2022

Great point.

a) Indeed, the following instances of user/org names on github contain underscore, these need to be added as exceptions to the url translator.

pj_nitin
up_the_irons
imarsh_emusand
alwell-kevin_emusand
nat_contoso
bryanaknight_emusand
alwell-kevin_demo
cjs_emusand
camilogarcialarotta_emusand
sbryant_emusand
jamiecool_emusand
evil-camilo_emusand
rohitnb_fabrikam
RaviGadhia_fabrikam
nielspineda_fabrikam
son7211_fabrikam

b) Renaming non-gh repos to an actual url may be a reasonable approach to deal with url translation for the remaining forges. This can be a task for an upcoming version V.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants