Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding extra utility and parsing abilities to the library #28

Open
binaryaaron opened this issue Oct 23, 2017 · 3 comments
Open

adding extra utility and parsing abilities to the library #28

binaryaaron opened this issue Oct 23, 2017 · 3 comments

Comments

@binaryaaron
Copy link
Contributor

I wanted to open this up for discussion.

The tweet parser as is has a ton of great functionality for working with tweets and makes a lot of tweet data readily accessible. However, there are other potential cases of working with tweet data that are not explicitly parsing predictable data from tweets.

For one example, if a person wants to grab tweets with ANY geolocation data and use it, they have to deal with both exact position (lat, long) and geojson place coordinates (bounding box and associated info). I had made some code to essentially unify the two methods to get data that was easily plottable, by creating a function that gets either precise position if it exists or a coordinate in the bounding box.

The code looks like this:

from functools import reduce

from tweet_parser.tweet_checking import is_original_format

try:
    import numpy as np
    mean_bbox = lambda x: list(np.array(x).mean(axis=0))
except ImportError:
    mean_bbox = lambda x: (reduce(lambda y, z: y + z, x) / len(x))

def get_profile_geo_coords(tweet):
    geo = tweet.profile_location.get("geo")
    coords = geo.get("coordinates") # in [LONG, LAT]
    if coords:
        long, lat = coords
    return lat, long


def get_place_coords(tweet, est_center=False):
    """
    Places are formal spots that define a bounding box around a place.
    Each coordinate pair in the bounding box is a set of [[lat, long], [lat, long]]
    pairs.
    
    """
    
    def get_bbox_ogformat():
        _place = tweet.get("place")
        if _place is None:
            return None
    
        return (_place
                .get("bounding_box")
                .get("coordinates")[0])

    def get_bbox_asformat():
        _place = tweet.get("location")
        if _place is None:
            return None
        return (_place
                .get("geo")
                .get("coordinates")[0])
        
    bbox = get_bbox_ogformat() if is_original_format(tweet) else get_bbox_asformat()

    return mean_bbox(bbox) if est_center else bbox


def get_exact_geo_coords(tweet):
    geo = tweet.get("geo")
    if geo is None:
        return None
    
    # coordinates.coordinates is [LONG, LAT]
    # geo.coordinates is [LAT, LONG]
    field = "geo" if is_original_format(tweet) else "geo"
    coords = tweet.get(field).get("coordinates")
    return coords


def get_a_geo_coordinate(tweet):
    """Returns a (lat, long) tuple that corresponds to a point within the bounding box of this tweet
    or the precise geolocation if it exists.
    """
    geo = get_exact_geo_coords(tweet)
    lat, long = geo if geo else (None, None)
    if lat:
        return lat, long
    long, lat = get_place_coords(tweet, est_center=True)
    return lat, long

Should we have an auxiliary module in here that allows for storing such code? I think it could be useful long-term in centralizing our efforts, sharing code, and helping end users get work done quickly. I am not at all opposed to putting this type of code elsewhere either.

@jrmontag
Copy link
Collaborator

This is a good question and line of thought. At the highest level I think all of these things you've shared have a reasonable place in this library. They're all about parsing and accessing the data (relevant), but not really conducting analysis on that data (less relevant).

Others can jump in if this feels off, but it has felt to me that the getters defined in the library so far (as attributes on the Tweet object) are all written in terms of explicit payload content. So, for example, I believe get_exact_geo_coords() exists in the tweet_geo.py module, but doesn't have a similar getter for the bounding region of a Twitter Place. Some form of an explicitly prioritized get_any_geo() that cascades through the possibilities could be useful too.

I also think it makes a lot of sense to provide some lightweight convenience methods that do some of the geo manipulations with e.g. bounding regions.

My initial feeling is this: add getters to the Tweet object for all of these explicit geo payload elements (i.e. bounding box and point coordinates) while aligning their signitures, and introduce a new namespace for utilities (as an example: tweet_parser.utils.geo) that we try to limit to highly-relevant things a user might sensibly want to do with the outputs of the Tweet getters e.g. finding the center of a bounding region. We could work to define where that interface should be (between in- and out-of-scope for this library) in an ongoing manner.

@fionapigott
Copy link
Contributor

I think that adding a getter for some other piece of data (like the bounding box of a place) is definitely a getter, but get_any_geo() might be a getter or a convenience function. I think it's good to not put way too much stuff in the Tweet obj, since it's going to be necessarily huge anyway.

I think that we should introduce some kind of utils namespace, to store utility functions, there are already one or two (remove_links comes to mind, which is something I made a long time ago and didn't want to delete) that should probably be moved to a utils namespace.

A good step might be fore @binaryaaron to add those functions as a new namespace and submit a review? Or, @binaryaaron I'd be happy to do it if you want me to.

@binaryaaron
Copy link
Contributor Author

I'd be happy to make a PR for this and will do so in the near-ish future. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants