-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathText Classification Assignment.json
1 lines (1 loc) · 116 KB
/
Text Classification Assignment.json
1
{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.6.8"},"colab":{"name":"Text Classification Assignment.ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"markdown","metadata":{"id":"mDMgSstPYv0P","colab_type":"text"},"source":["# Text Classification:\n","\n","## Data\n","<pre>\n","1. we have total of 20 types of documents(Text files) and total 18828 documents(text files).\n","2. You can download data from this <a href='https://drive.google.com/open?id=1rxD15nyeIPIAZ-J2VYPrDRZI66-TBWvM'>link</a>, in that you will get documents.rar folder. <br>If you unzip that, you will get total of 18828 documnets. document name is defined as'ClassLabel_DocumentNumberInThatLabel'. \n","so from document name, you can extract the label for that document.\n","4. Now our problem is to classify all the documents into any one of the class.\n","5. Below we provided count plot of all the labels in our data. \n","</pre>"]},{"cell_type":"code","metadata":{"id":"64U9NzWFYv0V","colab_type":"code","outputId":"f3f19ed2-f637-4a8c-cff7-40a603025e96","colab":{}},"source":["### count plot of all the class labels. "],"execution_count":0,"outputs":[{"output_type":"display_data","data":{"application/javascript":["/* Put everything inside the global mpl namespace */\n","window.mpl = {};\n","\n","\n","mpl.get_websocket_type = function() {\n"," if (typeof(WebSocket) !== 'undefined') {\n"," return WebSocket;\n"," } else if (typeof(MozWebSocket) !== 'undefined') {\n"," return MozWebSocket;\n"," } else {\n"," alert('Your browser does not have WebSocket support.' +\n"," 'Please try Chrome, Safari or Firefox ≥ 6. ' +\n"," 'Firefox 4 and 5 are also supported but you ' +\n"," 'have to enable WebSockets in about:config.');\n"," };\n","}\n","\n","mpl.figure = function(figure_id, websocket, ondownload, parent_element) {\n"," this.id = figure_id;\n","\n"," this.ws = websocket;\n","\n"," this.supports_binary = (this.ws.binaryType != undefined);\n","\n"," if (!this.supports_binary) {\n"," var warnings = document.getElementById(\"mpl-warnings\");\n"," if (warnings) {\n"," warnings.style.display = 'block';\n"," warnings.textContent = (\n"," \"This browser does not support binary websocket messages. \" +\n"," \"Performance may be slow.\");\n"," }\n"," }\n","\n"," this.imageObj = new Image();\n","\n"," this.context = undefined;\n"," this.message = undefined;\n"," this.canvas = undefined;\n"," this.rubberband_canvas = undefined;\n"," this.rubberband_context = undefined;\n"," this.format_dropdown = undefined;\n","\n"," this.image_mode = 'full';\n","\n"," this.root = $('<div/>');\n"," this._root_extra_style(this.root)\n"," this.root.attr('style', 'display: inline-block');\n","\n"," $(parent_element).append(this.root);\n","\n"," this._init_header(this);\n"," this._init_canvas(this);\n"," this._init_toolbar(this);\n","\n"," var fig = this;\n","\n"," this.waiting = false;\n","\n"," this.ws.onopen = function () {\n"," fig.send_message(\"supports_binary\", {value: fig.supports_binary});\n"," fig.send_message(\"send_image_mode\", {});\n"," if (mpl.ratio != 1) {\n"," fig.send_message(\"set_dpi_ratio\", {'dpi_ratio': mpl.ratio});\n"," }\n"," fig.send_message(\"refresh\", {});\n"," }\n","\n"," this.imageObj.onload = function() {\n"," if (fig.image_mode == 'full') {\n"," // Full images could contain transparency (where diff images\n"," // almost always do), so we need to clear the canvas so that\n"," // there is no ghosting.\n"," fig.context.clearRect(0, 0, fig.canvas.width, fig.canvas.height);\n"," }\n"," fig.context.drawImage(fig.imageObj, 0, 0);\n"," };\n","\n"," this.imageObj.onunload = function() {\n"," fig.ws.close();\n"," }\n","\n"," this.ws.onmessage = this._make_on_message_function(this);\n","\n"," this.ondownload = ondownload;\n","}\n","\n","mpl.figure.prototype._init_header = function() {\n"," var titlebar = $(\n"," '<div class=\"ui-dialog-titlebar ui-widget-header ui-corner-all ' +\n"," 'ui-helper-clearfix\"/>');\n"," var titletext = $(\n"," '<div class=\"ui-dialog-title\" style=\"width: 100%; ' +\n"," 'text-align: center; padding: 3px;\"/>');\n"," titlebar.append(titletext)\n"," this.root.append(titlebar);\n"," this.header = titletext[0];\n","}\n","\n","\n","\n","mpl.figure.prototype._canvas_extra_style = function(canvas_div) {\n","\n","}\n","\n","\n","mpl.figure.prototype._root_extra_style = function(canvas_div) {\n","\n","}\n","\n","mpl.figure.prototype._init_canvas = function() {\n"," var fig = this;\n","\n"," var canvas_div = $('<div/>');\n","\n"," canvas_div.attr('style', 'position: relative; clear: both; outline: 0');\n","\n"," function canvas_keyboard_event(event) {\n"," return fig.key_event(event, event['data']);\n"," }\n","\n"," canvas_div.keydown('key_press', canvas_keyboard_event);\n"," canvas_div.keyup('key_release', canvas_keyboard_event);\n"," this.canvas_div = canvas_div\n"," this._canvas_extra_style(canvas_div)\n"," this.root.append(canvas_div);\n","\n"," var canvas = $('<canvas/>');\n"," canvas.addClass('mpl-canvas');\n"," canvas.attr('style', \"left: 0; top: 0; z-index: 0; outline: 0\")\n","\n"," this.canvas = canvas[0];\n"," this.context = canvas[0].getContext(\"2d\");\n","\n"," var backingStore = this.context.backingStorePixelRatio ||\n","\tthis.context.webkitBackingStorePixelRatio ||\n","\tthis.context.mozBackingStorePixelRatio ||\n","\tthis.context.msBackingStorePixelRatio ||\n","\tthis.context.oBackingStorePixelRatio ||\n","\tthis.context.backingStorePixelRatio || 1;\n","\n"," mpl.ratio = (window.devicePixelRatio || 1) / backingStore;\n","\n"," var rubberband = $('<canvas/>');\n"," rubberband.attr('style', \"position: absolute; left: 0; top: 0; z-index: 1;\")\n","\n"," var pass_mouse_events = true;\n","\n"," canvas_div.resizable({\n"," start: function(event, ui) {\n"," pass_mouse_events = false;\n"," },\n"," resize: function(event, ui) {\n"," fig.request_resize(ui.size.width, ui.size.height);\n"," },\n"," stop: function(event, ui) {\n"," pass_mouse_events = true;\n"," fig.request_resize(ui.size.width, ui.size.height);\n"," },\n"," });\n","\n"," function mouse_event_fn(event) {\n"," if (pass_mouse_events)\n"," return fig.mouse_event(event, event['data']);\n"," }\n","\n"," rubberband.mousedown('button_press', mouse_event_fn);\n"," rubberband.mouseup('button_release', mouse_event_fn);\n"," // Throttle sequential mouse events to 1 every 20ms.\n"," rubberband.mousemove('motion_notify', mouse_event_fn);\n","\n"," rubberband.mouseenter('figure_enter', mouse_event_fn);\n"," rubberband.mouseleave('figure_leave', mouse_event_fn);\n","\n"," canvas_div.on(\"wheel\", function (event) {\n"," event = event.originalEvent;\n"," event['data'] = 'scroll'\n"," if (event.deltaY < 0) {\n"," event.step = 1;\n"," } else {\n"," event.step = -1;\n"," }\n"," mouse_event_fn(event);\n"," });\n","\n"," canvas_div.append(canvas);\n"," canvas_div.append(rubberband);\n","\n"," this.rubberband = rubberband;\n"," this.rubberband_canvas = rubberband[0];\n"," this.rubberband_context = rubberband[0].getContext(\"2d\");\n"," this.rubberband_context.strokeStyle = \"#000000\";\n","\n"," this._resize_canvas = function(width, height) {\n"," // Keep the size of the canvas, canvas container, and rubber band\n"," // canvas in synch.\n"," canvas_div.css('width', width)\n"," canvas_div.css('height', height)\n","\n"," canvas.attr('width', width * mpl.ratio);\n"," canvas.attr('height', height * mpl.ratio);\n"," canvas.attr('style', 'width: ' + width + 'px; height: ' + height + 'px;');\n","\n"," rubberband.attr('width', width);\n"," rubberband.attr('height', height);\n"," }\n","\n"," // Set the figure to an initial 600x600px, this will subsequently be updated\n"," // upon first draw.\n"," this._resize_canvas(600, 600);\n","\n"," // Disable right mouse context menu.\n"," $(this.rubberband_canvas).bind(\"contextmenu\",function(e){\n"," return false;\n"," });\n","\n"," function set_focus () {\n"," canvas.focus();\n"," canvas_div.focus();\n"," }\n","\n"," window.setTimeout(set_focus, 100);\n","}\n","\n","mpl.figure.prototype._init_toolbar = function() {\n"," var fig = this;\n","\n"," var nav_element = $('<div/>')\n"," nav_element.attr('style', 'width: 100%');\n"," this.root.append(nav_element);\n","\n"," // Define a callback function for later on.\n"," function toolbar_event(event) {\n"," return fig.toolbar_button_onclick(event['data']);\n"," }\n"," function toolbar_mouse_event(event) {\n"," return fig.toolbar_button_onmouseover(event['data']);\n"," }\n","\n"," for(var toolbar_ind in mpl.toolbar_items) {\n"," var name = mpl.toolbar_items[toolbar_ind][0];\n"," var tooltip = mpl.toolbar_items[toolbar_ind][1];\n"," var image = mpl.toolbar_items[toolbar_ind][2];\n"," var method_name = mpl.toolbar_items[toolbar_ind][3];\n","\n"," if (!name) {\n"," // put a spacer in here.\n"," continue;\n"," }\n"," var button = $('<button/>');\n"," button.addClass('ui-button ui-widget ui-state-default ui-corner-all ' +\n"," 'ui-button-icon-only');\n"," button.attr('role', 'button');\n"," button.attr('aria-disabled', 'false');\n"," button.click(method_name, toolbar_event);\n"," button.mouseover(tooltip, toolbar_mouse_event);\n","\n"," var icon_img = $('<span/>');\n"," icon_img.addClass('ui-button-icon-primary ui-icon');\n"," icon_img.addClass(image);\n"," icon_img.addClass('ui-corner-all');\n","\n"," var tooltip_span = $('<span/>');\n"," tooltip_span.addClass('ui-button-text');\n"," tooltip_span.html(tooltip);\n","\n"," button.append(icon_img);\n"," button.append(tooltip_span);\n","\n"," nav_element.append(button);\n"," }\n","\n"," var fmt_picker_span = $('<span/>');\n","\n"," var fmt_picker = $('<select/>');\n"," fmt_picker.addClass('mpl-toolbar-option ui-widget ui-widget-content');\n"," fmt_picker_span.append(fmt_picker);\n"," nav_element.append(fmt_picker_span);\n"," this.format_dropdown = fmt_picker[0];\n","\n"," for (var ind in mpl.extensions) {\n"," var fmt = mpl.extensions[ind];\n"," var option = $(\n"," '<option/>', {selected: fmt === mpl.default_extension}).html(fmt);\n"," fmt_picker.append(option)\n"," }\n","\n"," // Add hover states to the ui-buttons\n"," $( \".ui-button\" ).hover(\n"," function() { $(this).addClass(\"ui-state-hover\");},\n"," function() { $(this).removeClass(\"ui-state-hover\");}\n"," );\n","\n"," var status_bar = $('<span class=\"mpl-message\"/>');\n"," nav_element.append(status_bar);\n"," this.message = status_bar[0];\n","}\n","\n","mpl.figure.prototype.request_resize = function(x_pixels, y_pixels) {\n"," // Request matplotlib to resize the figure. Matplotlib will then trigger a resize in the client,\n"," // which will in turn request a refresh of the image.\n"," this.send_message('resize', {'width': x_pixels, 'height': y_pixels});\n","}\n","\n","mpl.figure.prototype.send_message = function(type, properties) {\n"," properties['type'] = type;\n"," properties['figure_id'] = this.id;\n"," this.ws.send(JSON.stringify(properties));\n","}\n","\n","mpl.figure.prototype.send_draw_message = function() {\n"," if (!this.waiting) {\n"," this.waiting = true;\n"," this.ws.send(JSON.stringify({type: \"draw\", figure_id: this.id}));\n"," }\n","}\n","\n","\n","mpl.figure.prototype.handle_save = function(fig, msg) {\n"," var format_dropdown = fig.format_dropdown;\n"," var format = format_dropdown.options[format_dropdown.selectedIndex].value;\n"," fig.ondownload(fig, format);\n","}\n","\n","\n","mpl.figure.prototype.handle_resize = function(fig, msg) {\n"," var size = msg['size'];\n"," if (size[0] != fig.canvas.width || size[1] != fig.canvas.height) {\n"," fig._resize_canvas(size[0], size[1]);\n"," fig.send_message(\"refresh\", {});\n"," };\n","}\n","\n","mpl.figure.prototype.handle_rubberband = function(fig, msg) {\n"," var x0 = msg['x0'] / mpl.ratio;\n"," var y0 = (fig.canvas.height - msg['y0']) / mpl.ratio;\n"," var x1 = msg['x1'] / mpl.ratio;\n"," var y1 = (fig.canvas.height - msg['y1']) / mpl.ratio;\n"," x0 = Math.floor(x0) + 0.5;\n"," y0 = Math.floor(y0) + 0.5;\n"," x1 = Math.floor(x1) + 0.5;\n"," y1 = Math.floor(y1) + 0.5;\n"," var min_x = Math.min(x0, x1);\n"," var min_y = Math.min(y0, y1);\n"," var width = Math.abs(x1 - x0);\n"," var height = Math.abs(y1 - y0);\n","\n"," fig.rubberband_context.clearRect(\n"," 0, 0, fig.canvas.width, fig.canvas.height);\n","\n"," fig.rubberband_context.strokeRect(min_x, min_y, width, height);\n","}\n","\n","mpl.figure.prototype.handle_figure_label = function(fig, msg) {\n"," // Updates the figure title.\n"," fig.header.textContent = msg['label'];\n","}\n","\n","mpl.figure.prototype.handle_cursor = function(fig, msg) {\n"," var cursor = msg['cursor'];\n"," switch(cursor)\n"," {\n"," case 0:\n"," cursor = 'pointer';\n"," break;\n"," case 1:\n"," cursor = 'default';\n"," break;\n"," case 2:\n"," cursor = 'crosshair';\n"," break;\n"," case 3:\n"," cursor = 'move';\n"," break;\n"," }\n"," fig.rubberband_canvas.style.cursor = cursor;\n","}\n","\n","mpl.figure.prototype.handle_message = function(fig, msg) {\n"," fig.message.textContent = msg['message'];\n","}\n","\n","mpl.figure.prototype.handle_draw = function(fig, msg) {\n"," // Request the server to send over a new figure.\n"," fig.send_draw_message();\n","}\n","\n","mpl.figure.prototype.handle_image_mode = function(fig, msg) {\n"," fig.image_mode = msg['mode'];\n","}\n","\n","mpl.figure.prototype.updated_canvas_event = function() {\n"," // Called whenever the canvas gets updated.\n"," this.send_message(\"ack\", {});\n","}\n","\n","// A function to construct a web socket function for onmessage handling.\n","// Called in the figure constructor.\n","mpl.figure.prototype._make_on_message_function = function(fig) {\n"," return function socket_on_message(evt) {\n"," if (evt.data instanceof Blob) {\n"," /* FIXME: We get \"Resource interpreted as Image but\n"," * transferred with MIME type text/plain:\" errors on\n"," * Chrome. But how to set the MIME type? It doesn't seem\n"," * to be part of the websocket stream */\n"," evt.data.type = \"image/png\";\n","\n"," /* Free the memory for the previous frames */\n"," if (fig.imageObj.src) {\n"," (window.URL || window.webkitURL).revokeObjectURL(\n"," fig.imageObj.src);\n"," }\n","\n"," fig.imageObj.src = (window.URL || window.webkitURL).createObjectURL(\n"," evt.data);\n"," fig.updated_canvas_event();\n"," fig.waiting = false;\n"," return;\n"," }\n"," else if (typeof evt.data === 'string' && evt.data.slice(0, 21) == \"data:image/png;base64\") {\n"," fig.imageObj.src = evt.data;\n"," fig.updated_canvas_event();\n"," fig.waiting = false;\n"," return;\n"," }\n","\n"," var msg = JSON.parse(evt.data);\n"," var msg_type = msg['type'];\n","\n"," // Call the \"handle_{type}\" callback, which takes\n"," // the figure and JSON message as its only arguments.\n"," try {\n"," var callback = fig[\"handle_\" + msg_type];\n"," } catch (e) {\n"," console.log(\"No handler for the '\" + msg_type + \"' message type: \", msg);\n"," return;\n"," }\n","\n"," if (callback) {\n"," try {\n"," // console.log(\"Handling '\" + msg_type + \"' message: \", msg);\n"," callback(fig, msg);\n"," } catch (e) {\n"," console.log(\"Exception inside the 'handler_\" + msg_type + \"' callback:\", e, e.stack, msg);\n"," }\n"," }\n"," };\n","}\n","\n","// from http://stackoverflow.com/questions/1114465/getting-mouse-location-in-canvas\n","mpl.findpos = function(e) {\n"," //this section is from http://www.quirksmode.org/js/events_properties.html\n"," var targ;\n"," if (!e)\n"," e = window.event;\n"," if (e.target)\n"," targ = e.target;\n"," else if (e.srcElement)\n"," targ = e.srcElement;\n"," if (targ.nodeType == 3) // defeat Safari bug\n"," targ = targ.parentNode;\n","\n"," // jQuery normalizes the pageX and pageY\n"," // pageX,Y are the mouse positions relative to the document\n"," // offset() returns the position of the element relative to the document\n"," var x = e.pageX - $(targ).offset().left;\n"," var y = e.pageY - $(targ).offset().top;\n","\n"," return {\"x\": x, \"y\": y};\n","};\n","\n","/*\n"," * return a copy of an object with only non-object keys\n"," * we need this to avoid circular references\n"," * http://stackoverflow.com/a/24161582/3208463\n"," */\n","function simpleKeys (original) {\n"," return Object.keys(original).reduce(function (obj, key) {\n"," if (typeof original[key] !== 'object')\n"," obj[key] = original[key]\n"," return obj;\n"," }, {});\n","}\n","\n","mpl.figure.prototype.mouse_event = function(event, name) {\n"," var canvas_pos = mpl.findpos(event)\n","\n"," if (name === 'button_press')\n"," {\n"," this.canvas.focus();\n"," this.canvas_div.focus();\n"," }\n","\n"," var x = canvas_pos.x * mpl.ratio;\n"," var y = canvas_pos.y * mpl.ratio;\n","\n"," this.send_message(name, {x: x, y: y, button: event.button,\n"," step: event.step,\n"," guiEvent: simpleKeys(event)});\n","\n"," /* This prevents the web browser from automatically changing to\n"," * the text insertion cursor when the button is pressed. We want\n"," * to control all of the cursor setting manually through the\n"," * 'cursor' event from matplotlib */\n"," event.preventDefault();\n"," return false;\n","}\n","\n","mpl.figure.prototype._key_event_extra = function(event, name) {\n"," // Handle any extra behaviour associated with a key event\n","}\n","\n","mpl.figure.prototype.key_event = function(event, name) {\n","\n"," // Prevent repeat events\n"," if (name == 'key_press')\n"," {\n"," if (event.which === this._key)\n"," return;\n"," else\n"," this._key = event.which;\n"," }\n"," if (name == 'key_release')\n"," this._key = null;\n","\n"," var value = '';\n"," if (event.ctrlKey && event.which != 17)\n"," value += \"ctrl+\";\n"," if (event.altKey && event.which != 18)\n"," value += \"alt+\";\n"," if (event.shiftKey && event.which != 16)\n"," value += \"shift+\";\n","\n"," value += 'k';\n"," value += event.which.toString();\n","\n"," this._key_event_extra(event, name);\n","\n"," this.send_message(name, {key: value,\n"," guiEvent: simpleKeys(event)});\n"," return false;\n","}\n","\n","mpl.figure.prototype.toolbar_button_onclick = function(name) {\n"," if (name == 'download') {\n"," this.handle_save(this, null);\n"," } else {\n"," this.send_message(\"toolbar_button\", {name: name});\n"," }\n","};\n","\n","mpl.figure.prototype.toolbar_button_onmouseover = function(tooltip) {\n"," this.message.textContent = tooltip;\n","};\n","mpl.toolbar_items = [[\"Home\", \"Reset original view\", \"fa fa-home icon-home\", \"home\"], [\"Back\", \"Back to previous view\", \"fa fa-arrow-left icon-arrow-left\", \"back\"], [\"Forward\", \"Forward to next view\", \"fa fa-arrow-right icon-arrow-right\", \"forward\"], [\"\", \"\", \"\", \"\"], [\"Pan\", \"Pan axes with left mouse, zoom with right\", \"fa fa-arrows icon-move\", \"pan\"], [\"Zoom\", \"Zoom to rectangle\", \"fa fa-square-o icon-check-empty\", \"zoom\"], [\"\", \"\", \"\", \"\"], [\"Download\", \"Download plot\", \"fa fa-floppy-o icon-save\", \"download\"]];\n","\n","mpl.extensions = [\"eps\", \"jpeg\", \"pdf\", \"png\", \"ps\", \"raw\", \"svg\", \"tif\"];\n","\n","mpl.default_extension = \"png\";var comm_websocket_adapter = function(comm) {\n"," // Create a \"websocket\"-like object which calls the given IPython comm\n"," // object with the appropriate methods. Currently this is a non binary\n"," // socket, so there is still some room for performance tuning.\n"," var ws = {};\n","\n"," ws.close = function() {\n"," comm.close()\n"," };\n"," ws.send = function(m) {\n"," //console.log('sending', m);\n"," comm.send(m);\n"," };\n"," // Register the callback with on_msg.\n"," comm.on_msg(function(msg) {\n"," //console.log('receiving', msg['content']['data'], msg);\n"," // Pass the mpl event to the overridden (by mpl) onmessage function.\n"," ws.onmessage(msg['content']['data'])\n"," });\n"," return ws;\n","}\n","\n","mpl.mpl_figure_comm = function(comm, msg) {\n"," // This is the function which gets called when the mpl process\n"," // starts-up an IPython Comm through the \"matplotlib\" channel.\n","\n"," var id = msg.content.data.id;\n"," // Get hold of the div created by the display call when the Comm\n"," // socket was opened in Python.\n"," var element = $(\"#\" + id);\n"," var ws_proxy = comm_websocket_adapter(comm)\n","\n"," function ondownload(figure, format) {\n"," window.open(figure.imageObj.src);\n"," }\n","\n"," var fig = new mpl.figure(id, ws_proxy,\n"," ondownload,\n"," element.get(0));\n","\n"," // Call onopen now - mpl needs it, as it is assuming we've passed it a real\n"," // web socket which is closed, not our websocket->open comm proxy.\n"," ws_proxy.onopen();\n","\n"," fig.parent_element = element.get(0);\n"," fig.cell_info = mpl.find_output_cell(\"<div id='\" + id + \"'></div>\");\n"," if (!fig.cell_info) {\n"," console.error(\"Failed to find cell for figure\", id, fig);\n"," return;\n"," }\n","\n"," var output_index = fig.cell_info[2]\n"," var cell = fig.cell_info[0];\n","\n","};\n","\n","mpl.figure.prototype.handle_close = function(fig, msg) {\n"," var width = fig.canvas.width/mpl.ratio\n"," fig.root.unbind('remove')\n","\n"," // Update the output cell to use the data from the current canvas.\n"," fig.push_to_output();\n"," var dataURL = fig.canvas.toDataURL();\n"," // Re-enable the keyboard manager in IPython - without this line, in FF,\n"," // the notebook keyboard shortcuts fail.\n"," IPython.keyboard_manager.enable()\n"," $(fig.parent_element).html('<img src=\"' + dataURL + '\" width=\"' + width + '\">');\n"," fig.close_ws(fig, msg);\n","}\n","\n","mpl.figure.prototype.close_ws = function(fig, msg){\n"," fig.send_message('closing', msg);\n"," // fig.ws.close()\n","}\n","\n","mpl.figure.prototype.push_to_output = function(remove_interactive) {\n"," // Turn the data on the canvas into data in the output cell.\n"," var width = this.canvas.width/mpl.ratio\n"," var dataURL = this.canvas.toDataURL();\n"," this.cell_info[1]['text/html'] = '<img src=\"' + dataURL + '\" width=\"' + width + '\">';\n","}\n","\n","mpl.figure.prototype.updated_canvas_event = function() {\n"," // Tell IPython that the notebook contents must change.\n"," IPython.notebook.set_dirty(true);\n"," this.send_message(\"ack\", {});\n"," var fig = this;\n"," // Wait a second, then push the new image to the DOM so\n"," // that it is saved nicely (might be nice to debounce this).\n"," setTimeout(function () { fig.push_to_output() }, 1000);\n","}\n","\n","mpl.figure.prototype._init_toolbar = function() {\n"," var fig = this;\n","\n"," var nav_element = $('<div/>')\n"," nav_element.attr('style', 'width: 100%');\n"," this.root.append(nav_element);\n","\n"," // Define a callback function for later on.\n"," function toolbar_event(event) {\n"," return fig.toolbar_button_onclick(event['data']);\n"," }\n"," function toolbar_mouse_event(event) {\n"," return fig.toolbar_button_onmouseover(event['data']);\n"," }\n","\n"," for(var toolbar_ind in mpl.toolbar_items){\n"," var name = mpl.toolbar_items[toolbar_ind][0];\n"," var tooltip = mpl.toolbar_items[toolbar_ind][1];\n"," var image = mpl.toolbar_items[toolbar_ind][2];\n"," var method_name = mpl.toolbar_items[toolbar_ind][3];\n","\n"," if (!name) { continue; };\n","\n"," var button = $('<button class=\"btn btn-default\" href=\"#\" title=\"' + name + '\"><i class=\"fa ' + image + ' fa-lg\"></i></button>');\n"," button.click(method_name, toolbar_event);\n"," button.mouseover(tooltip, toolbar_mouse_event);\n"," nav_element.append(button);\n"," }\n","\n"," // Add the status bar.\n"," var status_bar = $('<span class=\"mpl-message\" style=\"text-align:right; float: right;\"/>');\n"," nav_element.append(status_bar);\n"," this.message = status_bar[0];\n","\n"," // Add the close button to the window.\n"," var buttongrp = $('<div class=\"btn-group inline pull-right\"></div>');\n"," var button = $('<button class=\"btn btn-mini btn-primary\" href=\"#\" title=\"Stop Interaction\"><i class=\"fa fa-power-off icon-remove icon-large\"></i></button>');\n"," button.click(function (evt) { fig.handle_close(fig, {}); } );\n"," button.mouseover('Stop Interaction', toolbar_mouse_event);\n"," buttongrp.append(button);\n"," var titlebar = this.root.find($('.ui-dialog-titlebar'));\n"," titlebar.prepend(buttongrp);\n","}\n","\n","mpl.figure.prototype._root_extra_style = function(el){\n"," var fig = this\n"," el.on(\"remove\", function(){\n","\tfig.close_ws(fig, {});\n"," });\n","}\n","\n","mpl.figure.prototype._canvas_extra_style = function(el){\n"," // this is important to make the div 'focusable\n"," el.attr('tabindex', 0)\n"," // reach out to IPython and tell the keyboard manager to turn it's self\n"," // off when our div gets focus\n","\n"," // location in version 3\n"," if (IPython.notebook.keyboard_manager) {\n"," IPython.notebook.keyboard_manager.register_events(el);\n"," }\n"," else {\n"," // location in version 2\n"," IPython.keyboard_manager.register_events(el);\n"," }\n","\n","}\n","\n","mpl.figure.prototype._key_event_extra = function(event, name) {\n"," var manager = IPython.notebook.keyboard_manager;\n"," if (!manager)\n"," manager = IPython.keyboard_manager;\n","\n"," // Check for shift+enter\n"," if (event.shiftKey && event.which == 13) {\n"," this.canvas_div.blur();\n"," event.shiftKey = false;\n"," // Send a \"J\" for go to next cell\n"," event.which = 74;\n"," event.keyCode = 74;\n"," manager.command_mode();\n"," manager.handle_keydown(event);\n"," }\n","}\n","\n","mpl.figure.prototype.handle_save = function(fig, msg) {\n"," fig.ondownload(fig, null);\n","}\n","\n","\n","mpl.find_output_cell = function(html_output) {\n"," // Return the cell and output element which can be found *uniquely* in the notebook.\n"," // Note - this is a bit hacky, but it is done because the \"notebook_saving.Notebook\"\n"," // IPython event is triggered only after the cells have been serialised, which for\n"," // our purposes (turning an active figure into a static one), is too late.\n"," var cells = IPython.notebook.get_cells();\n"," var ncells = cells.length;\n"," for (var i=0; i<ncells; i++) {\n"," var cell = cells[i];\n"," if (cell.cell_type === 'code'){\n"," for (var j=0; j<cell.output_area.outputs.length; j++) {\n"," var data = cell.output_area.outputs[j];\n"," if (data.data) {\n"," // IPython >= 3 moved mimebundle to data attribute of output\n"," data = data.data;\n"," }\n"," if (data['text/html'] == html_output) {\n"," return [cell, data, j];\n"," }\n"," }\n"," }\n"," }\n","}\n","\n","// Register the function which deals with the matplotlib target/channel.\n","// The kernel may be null if the page has been refreshed.\n","if (IPython.notebook.kernel != null) {\n"," IPython.notebook.kernel.comm_manager.register_target('matplotlib', mpl.mpl_figure_comm);\n","}\n"],"text/plain":["<IPython.core.display.Javascript object>"]},"metadata":{"tags":[]}},{"output_type":"display_data","data":{"text/html":["<img src=\"\" width=\"640\">"],"text/plain":["<IPython.core.display.HTML object>"]},"metadata":{"tags":[]}}]},{"cell_type":"markdown","metadata":{"id":"2mK4TJOFYv0h","colab_type":"text"},"source":["## Assignment:"]},{"cell_type":"markdown","metadata":{"id":"VlqYFVI3Yv0k","colab_type":"text"},"source":["#### sample document\n","<pre>\n","<font color='blue'>\n","Subject: A word of advice\n","From: jcopelan@nyx.cs.du.edu (The One and Only)\n","\n","In article < 65882@mimsy.umd.edu > mangoe@cs.umd.edu (Charley Wingate) writes:\n",">\n",">I've said 100 times that there is no \"alternative\" that should think you\n",">might have caught on by now. And there is no \"alternative\", but the point\n",">is, \"rationality\" isn't an alternative either. The problems of metaphysical\n",">and religious knowledge are unsolvable-- or I should say, humans cannot\n",">solve them.\n","\n","How does that saying go: Those who say it can't be done shouldn't interrupt\n","those who are doing it.\n","\n","Jim\n","--\n","Have you washed your brain today?\n","</font>\n","</pre>"]},{"cell_type":"markdown","metadata":{"id":"KAR5HoR1Yv0m","colab_type":"text"},"source":["### Preprocessing:\n","<pre>\n","useful links: <a href='http://www.pyregex.com/'>http://www.pyregex.com/</a>\n","\n","<font color='blue'><b>1.</b></font> Find all emails in the document and then get the text after the \"@\". and then split those texts by '.' \n","after that remove the words whose length is less than or equal to 2 and also remove'com' word and then combine those words by space. \n","In one doc, if we have 2 or more mails, get all.\n","<b>Eg:[test@dm1.d.com, test2@dm2.dm3.com]-->[dm1.d.com, dm3.dm4.com]-->[dm1,d,com,dm2,dm3,com]-->[dm1,dm2,dm3]-->\"dm1 dm2 dm3\" </b> \n","append all those into one list/array. ( This will give length of 18828 sentences i.e one list for each of the document). \n","Some sample output was shown below. \n","\n","> In the above sample document there are emails [jcopelan@nyx.cs.du.edu, 65882@mimsy.umd.edu, mangoe@cs.umd.edu]\n","\n","preprocessing:\n","[jcopelan@nyx.cs.du.edu, 65882@mimsy.umd.edu, mangoe@cs.umd.edu] ==> [nyx cs du edu mimsy umd edu cs umd edu] ==> \n","[nyx edu mimsy umd edu umd edu]\n","\n","<font color='blue'><b>2.</b></font> Replace all the emails by space in the original text. \n","</pre>"]},{"cell_type":"code","metadata":{"id":"KavKDD9FYv0p","colab_type":"code","outputId":"0b87ab7b-46df-4995-eaca-4f5831ad223e","colab":{}},"source":["# we have collected all emails and preprocessed them, this is sample output\n","preprocessed_email"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array(['juliet caltech edu',\n"," 'coding bchs edu newsgate sps mot austlcm sps mot austlcm sps mot com dna bchs edu',\n"," 'batman bmd trw', ..., 'rbdc wsnc org dscomsa desy zeus desy',\n"," 'rbdc wsnc org morrow stanford edu pangea Stanford EDU',\n"," 'rbdc wsnc org apollo apollo'], dtype=object)"]},"metadata":{"tags":[]},"execution_count":28}]},{"cell_type":"code","metadata":{"id":"obReqs55Yv0v","colab_type":"code","outputId":"10770414-9be0-4d63-9587-5363a8c10c4d","colab":{}},"source":["len(preprocessed_email)"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["18828"]},"metadata":{"tags":[]},"execution_count":29}]},{"cell_type":"markdown","metadata":{"id":"zIovFDQzYv03","colab_type":"text"},"source":["<pre>\n","<font color='blue'><b>3.</b></font> Get subject of the text i.e. get the total lines where \"Subject:\" occur and remove \n","the word which are before the \":\" remove the newlines, tabs, punctuations, any special chars.\n","<b>Eg: if we have sentance like \"Subject: Re: Gospel Dating @ \\r\\r\\n\" --> You have to get \"Gospel Dating\"</b> \n","Save all this data into another list/array. \n","\n","<font color='blue'><b>4.</b></font> After you store it in the list, Replace those sentances in original text by space.\n","\n","<font color='blue'><b>5.</b></font> Delete all the sentances where sentence starts with <b>\"Write to:\"</b> or <b>\"From:\"</b>.\n","> In the above sample document check the 2nd line, we should remove that\n","\n","<font color='blue'><b>6.</b></font> Delete all the tags like \"< anyword >\"\n","> In the above sample document check the 4nd line, we should remove that \"< 65882@mimsy.umd.edu >\"\n","\n","\n","<font color='blue'><b>7.</b></font> Delete all the data which are present in the brackets. \n","In many text data, we observed that, they maintained the explanation of sentence \n","or translation of sentence to another language in brackets so remove all those.\n","<b>Eg: \"AAIC-The course that gets you HIRED(AAIC - Der Kurs, der Sie anstellt)\" --> \"AAIC-The course that gets you HIRED\"</b>\n","\n","> In the above sample document check the 4nd line, we should remove that \"(Charley Wingate)\"\n","\n","\n","<font color='blue'><b>8.</b></font> Remove all the newlines('\\n'), tabs('\\t'), \"-\", \"\\\".\n","\n","<font color='blue'><b>9.</b></font> Remove all the words which ends with <b>\":\"</b>.\n","<b>Eg: \"Anyword:\"</b>\n","> In the above sample document check the 4nd line, we should remove that \"writes:\"\n","\n","\n","<font color='blue'><b>10.</b></font> Decontractions, replace words like below to full words. \n","please check the donors choose preprocessing for this \n","<b>Eg: can't -> can not, 's -> is, i've -> i have, i'm -> i am, you're -> you are, i'll --> i will </b>\n","\n","<b> There is no order to do point 6 to 10. but you have to get final output correctly</b>\n","\n","<font color='blue'><b>11.</b></font> Do chunking on the text you have after above preprocessing. \n","Text chunking, also referred to as shallow parsing, is a task that \n","follows Part-Of-Speech Tagging and that adds more structure to the sentence.\n","So it combines the some phrases, named entities into single word.\n","So after that combine all those phrases/named entities by separating <b>\"_\"</b>. \n","And remove the phrases/named entities if that is a \"Person\". \n","You can use <b>nltk.ne_chunk</b> to get these. \n","Below we have given one example. please go through it. \n","\n","useful links: \n","<a href='https://www.nltk.org/book/ch07.html'>https://www.nltk.org/book/ch07.html</a>\n","<a href='https://stackoverflow.com/a/31837224/4084039'>https://stackoverflow.com/a/31837224/4084039</a>\n","<a href='http://www.nltk.org/howto/tree.html'>http://www.nltk.org/howto/tree.html</a>\n","<a href='https://stackoverflow.com/a/44294377/4084039'>https://stackoverflow.com/a/44294377/4084039</a>\n","</pre>"]},{"cell_type":"code","metadata":{"id":"2lAaKQ6EYv04","colab_type":"code","outputId":"53b66a94-acef-4002-e51c-002bde4178b4","colab":{}},"source":["#i am living in the New York\n","print(\"i am living in the New York -->\", list(chunks))\n","print(\" \")\n","print(\"-\"*50)\n","print(\" \")\n","#My name is Srikanth Varma\n","print(\"My name is Srikanth Varma -->\", list(chunks1))"],"execution_count":0,"outputs":[{"output_type":"stream","text":["i am living in the New York --> [('i', 'NN'), ('am', 'VBP'), ('living', 'VBG'), ('in', 'IN'), ('the', 'DT'), Tree('GPE', [('New', 'NNP'), ('York', 'NNP')])]\n"," \n","--------------------------------------------------\n"," \n","My name is Srikanth Varma --> [('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), Tree('PERSON', [('Srikanth', 'NNP'), ('Varma', 'NNP')])]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"XV8gzLUjYv0-","colab_type":"text"},"source":["<pre>We did chunking for above two lines and then We got one list where each word is mapped to a \n","POS(parts of speech) and also if you see \"New York\" and \"Srikanth Varma\", \n","they got combined and represented as a tree and \"New York\" was referred as \"GPE\" and \"Srikanth Varma\" was referred as \"PERSON\". \n","so now you have to Combine the \"New York\" with <b>\"_\"</b> i.e \"New_York\"\n","and remove the \"Srikanth Varma\" from the above sentence because it is a person.</pre>"]},{"cell_type":"markdown","metadata":{"id":"VpaC-KF3Yv1A","colab_type":"text"},"source":["<pre>\n","<font color='blue'><b>13.</b></font> Replace all the digits with space i.e delete all the digits. \n","> In the above sample document, the 6th line have digit 100, so we have to remove that.\n","\n","<font color='blue'><b>14.</b></font> After doing above points, we observed there might be few word's like\n"," <b> \"_word_\" (i.e starting and ending with the _), \"_word\" (i.e starting with the _),\n"," \"word_\" (i.e ending with the _)</b> remove the <b>_</b> from these type of words. \n","\n","<font color='blue'><b>15.</b></font> We also observed some words like <b> \"OneLetter_word\"- eg: d_berlin, \n","\"TwoLetters_word\" - eg: dr_berlin </b>, in these words we remove the \"OneLetter_\" (d_berlin ==> berlin) and \n","\"TwoLetters_\" (de_berlin ==> berlin). i.e remove the words \n","which are length less than or equal to 2 after spliiting those words by \"_\". \n","\n","<font color='blue'><b>16.</b></font> Convert all the words into lower case and lowe case \n","and remove the words which are greater than or equal to 15 or less than or equal to 2.\n","\n","<font color='blue'><b>17.</b></font> replace all the words except \"A-Za-z_\" with space. \n","\n","<font color='blue'><b>18.</b></font> Now You got Preprocessed Text, email, subject. create a dataframe with those. \n","Below are the columns of the df. \n","</pre>"]},{"cell_type":"code","metadata":{"id":"hB43OGEfYv1C","colab_type":"code","outputId":"945bc8a4-1f99-4410-94c8-c776a405b5f0","colab":{}},"source":["data.columns"],"execution_count":0,"outputs":[{"output_type":"stream","text":["Index(['text', 'class', 'preprocessed_text', 'preprocessed_subject',\n"," 'preprocessed_emails'],\n"," dtype='object')\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"AM6A19xFYv1I","colab_type":"code","outputId":"9de13fa8-6604-49a2-8013-6b22f0a256a8","colab":{}},"source":["data.iloc[400]"],"execution_count":0,"outputs":[{"output_type":"stream","text":["text From: arc1@ukc.ac.uk (Tony Curtis)\\r\\r\\r\\nSubj...\n","class alt.atheism\n","preprocessed_text said re is article if followed the quoting rig...\n","preprocessed_subject christian morality is\n","preprocessed_emails ukc mac macalstr edu\n","Name: 567, dtype: object\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"rfWUeIN1Yv1N","colab_type":"text"},"source":["### To get above mentioned data frame --> Try to Write Total Preprocessing steps in One Function Named Preprocess as below. "]},{"cell_type":"code","metadata":{"id":"uEGEHTNQYv1N","colab_type":"code","colab":{}},"source":["def preprocess(Input_Text):\n"," \"\"\"Do all the Preprocessing as shown above and\n"," return a tuple contain preprocess_email,preprocess_subject,preprocess_text for that Text_data\"\"\"\n"," return (list_of_preproessed_emails,subject,text)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"ceASjKizYv1U","colab_type":"text"},"source":["### Code checking:\n","\n","<font color='red' size=4>\n","After Writing preprocess function. call that functoin with the input text of 'alt.atheism_49960' doc and print the output of the preprocess function\n","<br>\n","This will help us to evaluate faster, based on the output we can suggest you if there are any changes.\n","</font>"]},{"cell_type":"markdown","metadata":{"id":"2x3og_iaYv1S","colab_type":"text"},"source":["### After writing Preprocess function, call the function for each of the document(18828 docs) and then create a dataframe as mentioned above."]},{"cell_type":"markdown","metadata":{"id":"n3ucJLtWYv1V","colab_type":"text"},"source":["### Training The models to Classify: \n","\n","<pre>\n","1. Combine \"preprocessed_text\", \"preprocessed_subject\", \"preprocessed_emails\" into one column. use that column to model. \n","\n","2. Now Split the data into Train and test. use 25% for test also do a stratify split. \n","\n","3. Analyze your text data and pad the sequnce if required. \n","Sequnce length is not restricted, you can use anything of your choice. \n","you need to give the reasoning\n","\n","4. Do Tokenizer i.e convert text into numbers. please be careful while doing it. \n","if you are using tf.keras \"Tokenizer\" API, it removes the <b>\"_\"</b>, but we need that.\n","\n","5. code the model's ( Model-1, Model-2 ) as discussed below \n","and try to optimize that models. \n","\n","6. For every model use predefined Glove vectors. \n","<b>Don't train any word vectors while Training the model.</b>\n","\n","7. Use \"categorical_crossentropy\" as Loss. \n","\n","8. Use <b>Accuracy and Micro Avgeraged F1 score</b> as your as Key metrics to evaluate your model. \n","\n","9. Use Tensorboard to plot the loss and Metrics based on the epoches.\n","\n","10. Please save your best model weights in to <b>'best_model_L.h5' ( L = 1 or 2 )</b>. \n","\n","11. You are free to choose any Activation function, learning rate, optimizer.\n","But have to use the same architecture which we are giving below.\n","\n","12. You can add some layer to our architecture but you <b>deletion</b> of layer is not acceptable.\n","\n","13. Try to use <b>Early Stopping</b> technique or any of the callback techniques that you did in the previous assignments.\n","\n","14. For Every model save your model to image ( Plot the model) with shapes \n","and inlcude those images in the notebook markdown cell, \n","upload those imgages to Classroom. You can use \"plot_model\" \n","please refer <a href='https://www.tensorflow.org/api_docs/python/tf/keras/utils/plot_model'>this</a> if you don't know how to plot the model with shapes. \n","\n","</pre>"]},{"cell_type":"markdown","metadata":{"id":"c0mwdtcvYv1X","colab_type":"text"},"source":["### Model-1: Using 1D convolutions with word embeddings"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"gXPPsovJ3ePk"},"source":["<pre>\n","<b>Encoding of the Text </b> --> For a given text data create a Matrix with Embedding layer as shown Below. \n","In the example we have considered d = 5, but in this assignment we will get d = dimension of Word vectors we are using.\n"," i.e if we have maximum of 350 words in a sentence and embedding of 300 dim word vector, \n"," we result in 350*300 dimensional matrix for each sentance as output after embedding layer\n","<img src='https://i.imgur.com/kiVQuk1.png'>\n","Ref: https://i.imgur.com/kiVQuk1.png\n","\n","<b>Reference:</b>\n","<a href='https://stackoverflow.com/a/43399308/4084039'>https://stackoverflow.com/a/43399308/4084039</a>\n","<a href='https://missinglink.ai/guides/keras/keras-conv1d-working-1d-convolutional-neural-networks-keras/'>https://missinglink.ai/guides/keras/keras-conv1d-working-1d-convolutional-neural-networks-keras/</a>\n","\n","<b><a href='https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-work'>How EMBEDDING LAYER WORKS </a></b>\n","\n","</pre>\n","\n","### Go through this blog, if you have any doubt on using predefined Embedding values in Embedding layer - https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/"]},{"cell_type":"markdown","metadata":{"id":"wGVQKge3Yv1e","colab_type":"text"},"source":["<img src='https://i.imgur.com/fv1GvFJ.png'>\n","ref: 'https://i.imgur.com/fv1GvFJ.png'"]},{"cell_type":"markdown","metadata":{"id":"GC6SBG5AYv1f","colab_type":"text"},"source":["<pre>\n","1. all are Conv1D layers with any number of filter and filter sizes, there is no restriction on this.\n","\n","2. use concatenate layer is to concatenate all the filters/channels. \n","\n","3. You can use any pool size and stride for maxpooling layer.\n","\n","4. Don't use more than 16 filters in one Conv layer becuase it will increase the no of params. \n","( Only recommendation if you have less computing power )\n","\n","5. You can use any number of layers after the Flatten Layer.\n","</pre>"]},{"cell_type":"markdown","metadata":{"id":"9cg4L1V4Yv1d","colab_type":"text"},"source":["### Model-2 : Using 1D convolutions with character embedding"]},{"cell_type":"markdown","metadata":{"id":"2Djg4YVA3oQx","colab_type":"text"},"source":["<pre>\n","<pre><img src=\"https://i.ytimg.com/vi/CNY8VjJt-iQ/maxresdefault.jpg\" width=\"70%\">\n","Here are the some papers based on Char-CNN\n"," 1. Xiang Zhang, Junbo Zhao, Yann LeCun. <a href=\"http://arxiv.org/abs/1509.01626\">Character-level Convolutional Networks for Text Classification</a>.NIPS 2015\n"," 2. Yoon Kim, Yacine Jernite, David Sontag, Alexander M. Rush. <a href=\"https://arxiv.org/abs/1508.06615\">Character-Aware Neural Language Models</a>. AAAI 2016\n"," 3. Shaojie Bai, J. Zico Kolter, Vladlen Koltun. <a href=\"https://arxiv.org/pdf/1803.01271.pdf\">An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling</a>\n"," 4. Use the pratrained char embeddings <a href='https://github.com/minimaxir/char-embeddings/blob/master/glove.840B.300d-char.txt'>https://github.com/minimaxir/char-embeddings/blob/master/glove.840B.300d-char.txt</a>\n","</pre>"]},{"cell_type":"markdown","metadata":{"id":"VXvKSEIeSvN5","colab_type":"text"},"source":["<img src='https://i.imgur.com/EuuoJtr.png'>"]}]}