YouTube Caption API Python example converted to Python 3
Successful attempted at extracting closed caption information from Siraj Raval's ML videos. The ultimate goal of this extraction is to make a Sirajbot so that we can all have our own personal Siraj.
The original code is found in the download call for this API:
https://developers.google.com/youtube/v3/docs/captions
To run this code you need to get OAuth2 credentials from google:
https://console.developers.google.com/apis/credentials
Usage examples:
These examples extract the closed caption information from this video on creating a chatbot:
https://www.youtube.com/watch?v=t5qgjJIBy9g
python youtube-caption.py --action=list --videoid=t5qgjJIBy9g Caption track '(CTcNv09WGJMRU2JMppJBW2SFXPfsrJtllhu4z_DJ_fQ=)' in 'en' language. Created and managed caption tracks.
python youtube-caption.py --action=download --videoid=t5qgjJIBy9g --captionid=CTcNv09WGJMRU2JMppJBW2SFXPfsrJtllhu4z_DJ_fQ= First line of caption track: b"1\n00:00:00,000 --> 00:00:04,529\nhello world its Suraj and let's build a\n\n2\n00:00:02,429 --> 00:00:07,140\nchat bot that can answer questions about\n\n3\n00:00:04,529 --> 00:00:09,540\nany text you give it it an article or\n\n4\n00:00:07,140 --> 00:00:11,790\neven a book using care off just imagine\n\n5\n00:00:09,540 --> 00:00:13,920\nthe boost in productivity all of us will\n\n6\n00:00:11,790 --> 00:00:16,350\nhave once we have access to expert\n\n7\n00:00:13,920 --> 00:00:18,029\nsystems for any given topic instead of\n\n8\n00:00:16,350 --> 00:00:20,250\nsifting through all the jargon in a\n\n9\n00:00:18,029 --> 00:00:22,230\nscientific paper you just give it the\n\n10\n00:00:20,250 --> 00:00:25,230\npaper then ask it the relevant questions\n\n11\n00:00:22,230 --> 00:00:27,779\nentire textbooks libraries videos images\n\n12\n00:00:25,230 --> 00:00:30,240\nwhatever you just feed it some data and\n\n13\n00:00:27,779 --> 00:00:32,279\nit would become an expert at it all 7\n\n14\n00:00:30,240 --> 00:00:34,530\nbillion people on earth would have the\n\n15\n00:00:32,279 --> 00:00:36,360\ncapability of learning anything much\n\n16\n00:00:34,530 --> 00:00:39,270\nfaster the web democratize information\n\n17\n00:00:36,360 --> 00:00:41,910\nand this next evolution will democratize\n\n18\n00:00:39,270 --> 00:00:43,530\nsomething just as important guidance the\n\n19\n00:00:41,910 --> 00:00:46,170\nideal chat box and talk intelligently\n\n20\n00:00:43,530 --> 00:00:48,750\nabout any domain that's the Holy Grail\n\n21\n00:00:46,170 --> 00:00:51,090\nbut domain-specific chat bots are\n\n22\n00:00:48,750 --> 00:00:53,789\ndefinitely possible the technical term\n\n23\n00:00:51,090 --> 00:00:55,800\nfor this is a question answering system\n\n24\n00:00:53,789 --> 00:00:58,739\nsurprisingly we've been able to do this\n\n25\n00:00:55,800 --> 00:01:00,390\nsince way back in the 70s lunar was one\n\n26\n00:00:58,739 --> 00:01:02,489\nof the first it was as you might have\n\n27\n00:01:00,390 --> 00:01:04,860\nguessed rule-based so it allowed\n\n28\n00:01:02,489 --> 00:01:07,439\ngeologists to ask questions about moon\n\n29\n00:01:04,860 --> 00:01:09,270\nrocks from the Apollo missions a later\n\n30\n00:01:07,439 --> 00:01:11,729\nimprovement to rule-based Q&A systems\n\n31\n00:01:09,270 --> 00:01:13,680\nallowing programmers to encode patterns\n\n32\n00:01:11,729 --> 00:01:16,680\ninto their BOTS called artificial\n\n33\n00:01:13,680 --> 00:01:18,840\nintelligence markup language or a IML\n\n34\n00:01:16,680 --> 00:01:21,570\nthat meant less code for the same\n\n35\n00:01:18,840 --> 00:01:24,180\nresults but yeah don't use a IML it's so\n\n36\n00:01:21,570 --> 00:01:25,619\nold it makes numa numa look new now with\n\n37\n00:01:24,180 --> 00:01:27,869\ndeep learning we can do this without\n\n38\n00:01:25,619 --> 00:01:30,360\nhard coded responses and have much\n\n39\n00:01:27,869 --> 00:01:33,030\nbetter results the generic case is that\n\n40\n00:01:30,360 --> 00:01:35,250\nyou give it some tax as input and then\n\n41\n00:01:33,030 --> 00:01:37,259\nask it a question it'll give you the\n\n42\n00:01:35,250 --> 00:01:39,420\nright answer after logically reasoning\n\n43\n00:01:37,259 --> 00:01:42,090\nabout it the input could also be that\n\n44\n00:01:39,420 --> 00:01:44,100\neverybody is happy and then the question\n\n45\n00:01:42,090 --> 00:01:46,049\ncould be what's the sentiment the answer\n\n46\n00:01:44,100 --> 00:01:48,630\nwould be positive other possible\n\n47\n00:01:46,049 --> 00:01:50,670\nquestions are what's the entity what are\n\n48\n00:01:48,630 --> 00:01:53,549\nthe part of speech tags what's the\n\n49\n00:01:50,670 --> 00:01:55,979\ntranslation to French we need a common\n\n50\n00:01:53,549 --> 00:01:57,750\nmodel for all of these questions this is\n\n51\n00:01:55,979 --> 00:02:00,030\nwhat the AI community is trying to\n\n52\n00:01:57,750 --> 00:02:01,920\nfigure out how to do facebook research\n\n53\n00:02:00,030 --> 00:02:04,110\nmade some great progress with this just\n\n54\n00:02:01,920 --> 00:02:06,509\ntwo years ago when they released a paper\n\n55\n00:02:04,110 --> 00:02:09,780\nintroducing this really cool idea called\n\n56\n00:02:06,509 --> 00:02:12,599\na memory network lstm networks proved to\n\n57\n00:02:09,780 --> 00:02:13,800\nbe a useful tool in tasks like text\n\n58\n00:02:12,599 --> 00:02:16,050\nsummarization but\n\n59\n00:02:13,800 --> 00:02:19,110\ntheir memory encoded by hidden states\n\n60\n00:02:16,050 --> 00:02:22,200\nand weight is too small for very very\n\n61\n00:02:19,110 --> 00:02:25,410\nlong sequences of data be that a book or\n\n62\n00:02:22,200 --> 00:02:27,300\na movie a way around this for language\n\n63\n00:02:25,410 --> 00:02:29,880\ntranslation for example was to store\n\n64\n00:02:27,300 --> 00:02:31,860\nmultiple lstm states and use an\n\n65\n00:02:29,880 --> 00:02:34,110\nattention mechanism to choose between\n\n66\n00:02:31,860 --> 00:02:37,470\nthem but they develop another strategy\n\n67\n00:02:34,110 --> 00:02:40,080\nthat outperformed lft ms or QA systems\n\n68\n00:02:37,470 --> 00:02:42,540\nthe idea was to allow a neural network\n\n69\n00:02:40,080 --> 00:02:45,210\nto use an external data structure as\n\n70\n00:02:42,540 --> 00:02:47,370\nmemory storage it learns where to\n\n71\n00:02:45,210 --> 00:02:49,830\nretrieve the required memory from the\n\n72\n00:02:47,370 --> 00:02:51,750\nmemory bank in a supervised way when it\n\n73\n00:02:49,830 --> 00:02:54,060\ncame to entering questions from COI data\n\n74\n00:02:51,750 --> 00:02:55,950\nthat was generated that info was pretty\n\n75\n00:02:54,060 --> 00:02:59,100\neasy to come by but in real world data\n\n76\n00:02:55,950 --> 00:03:01,320\nit is not that easy most recently there\n\n77\n00:02:59,100 --> 00:03:03,660\nwas a four-month-long cattle contest\n\n78\n00:03:01,320 --> 00:03:06,570\nthat a startup called meta mind placed\n\n79\n00:03:03,660 --> 00:03:08,730\nin the top 5% for to do this they built\n\n80\n00:03:06,570 --> 00:03:11,520\na new state-of-the-art model called a\n\n81\n00:03:08,730 --> 00:03:14,130\ndynamic memory network that built on\n\n82\n00:03:11,520 --> 00:03:15,720\nFacebook's initial idea that's the one\n\n83\n00:03:14,130 --> 00:03:18,030\nwe'll focus on so let's build it\n\n84\n00:03:15,720 --> 00:03:20,010\nprogrammatically using care of this data\n\n85\n00:03:18,030 --> 00:03:22,410\nset is pretty well organized it was\n\n86\n00:03:20,010 --> 00:03:24,420\ncreated by Facebook AI research for the\n\n87\n00:03:22,410 --> 00:03:26,430\nspecific goal of improving textual\n\n88\n00:03:24,420 --> 00:03:29,489\nreasoning it's grouped into 20 different\n\n89\n00:03:26,430 --> 00:03:31,860\ntasks each task tests a different aspect\n\n90\n00:03:29,489 --> 00:03:33,720\nof reasoning so overall it provides a\n\n91\n00:03:31,860 --> 00:03:35,700\ngood overview of all the different\n\n92\n00:03:33,720 --> 00:03:37,500\ncapabilities of your learning model\n\n93\n00:03:35,700 --> 00:03:39,420\nthere are a thousand questions for\n\n94\n00:03:37,500 --> 00:03:41,700\ntraining at a thousand for testing per\n\n95\n00:03:39,420 --> 00:03:43,830\ntask each question is paired with a\n\n96\n00:03:41,700 --> 00:03:46,050\nstatement or series of statements as\n\n97\n00:03:43,830 --> 00:03:48,390\nwell as an answer the goal is to have\n\n98\n00:03:46,050 --> 00:03:50,940\none model that can succeed in all tasks\n\n99\n00:03:48,390 --> 00:03:52,980\neasily will use pre-trained glove\n\n100\n00:03:50,940 --> 00:03:55,200\nvectors to help create a sequence of war\n\n101\n00:03:52,980 --> 00:03:57,390\nvectors from our input sentences and\n\n102\n00:03:55,200 --> 00:03:59,970\nthese vectors will act as inputs to the\n\n103\n00:03:57,390 --> 00:04:02,070\nmodel the dmn architecture defines two\n\n104\n00:03:59,970 --> 00:04:04,680\ntypes of memory semantic and episodic\n\n105\n00:04:02,070 --> 00:04:07,320\nthese input vectors are considered the\n\n106\n00:04:04,680 --> 00:04:08,730\nsemantic memory whereas episodic memory\n\n107\n00:04:07,320 --> 00:04:11,130\nmight contain other knowledge as well\n\n108\n00:04:08,730 --> 00:04:12,810\nand we'll talk about that in a second we\n\n109\n00:04:11,130 --> 00:04:14,880\ncan fetch our babble data set from the\n\n110\n00:04:12,810 --> 00:04:16,769\nweb and split them into training and\n\n111\n00:04:14,880 --> 00:04:18,630\ntesting data the love will help convert\n\n112\n00:04:16,769 --> 00:04:20,760\nour words two vectors so they're ready\n\n113\n00:04:18,630 --> 00:04:23,280\nto be fed into our model the first\n\n114\n00:04:20,760 --> 00:04:25,229\nmodule the input module is a GRU or\n\n115\n00:04:23,280 --> 00:04:26,590\ngated recurrent unit that runs on a\n\n116\n00:04:25,229 --> 00:04:28,990\nsequence of words\n\n117\n00:04:26,590 --> 00:04:31,750\nvectors a GRU cell is kind of like an\n\n118\n00:04:28,990 --> 00:04:33,850\nlstm cell but it's more computationally\n\n119\n00:04:31,750 --> 00:04:36,520\nefficient since it only has two gates\n\n120\n00:04:33,850 --> 00:04:38,169\nand it doesn't use a memory unit the two\n\n121\n00:04:36,520 --> 00:04:40,600\ngates control when its content is\n\n122\n00:04:38,169 --> 00:04:45,190\nupdated and when it's erased off a\n\n123\n00:04:40,600 --> 00:04:49,479\nrecess up the resistance of a recession\n\n124\n00:04:45,190 --> 00:04:52,660\nand the hidden state of the input module\n\n125\n00:04:49,479 --> 00:04:54,910\nrepresents the input process so far in a\n\n126\n00:04:52,660 --> 00:04:57,100\nvector it outputs hidden States after\n\n127\n00:04:54,910 --> 00:04:59,169\nevery sentence and these outputs are\n\n128\n00:04:57,100 --> 00:05:00,910\ncalled facts and the paper because they\n\n129\n00:04:59,169 --> 00:05:02,620\nrepresent the essence of what is fed\n\n130\n00:05:00,910 --> 00:05:04,570\ngiven a word vector and the previous\n\n131\n00:05:02,620 --> 00:05:06,820\ntime step detector will compute the\n\n132\n00:05:04,570 --> 00:05:08,889\ncurrent time step vector the uplinking\n\n133\n00:05:06,820 --> 00:05:11,020\nis a single layer neural network we sum\n\n134\n00:05:08,889 --> 00:05:13,600\nup the matrix multiplications and add a\n\n135\n00:05:11,020 --> 00:05:15,430\nbias term and then the signal it\n\n136\n00:05:13,600 --> 00:05:18,370\nsquashes it to a list of values between\n\n137\n00:05:15,430 --> 00:05:20,560\n0 and 1 the output vector we do this\n\n138\n00:05:18,370 --> 00:05:22,900\ntwice with different sets of weights\n\n139\n00:05:20,560 --> 00:05:24,789\nthen we use a reset gate that will learn\n\n140\n00:05:22,900 --> 00:05:26,889\nto ignore the past time steps when\n\n141\n00:05:24,789 --> 00:05:29,020\nnecessary for example if the next\n\n142\n00:05:26,889 --> 00:05:31,450\nsentence has nothing to do with those\n\n143\n00:05:29,020 --> 00:05:32,830\nthat came before it the update gate is\n\n144\n00:05:31,450 --> 00:05:35,830\nsimilar in that it can learn to ignore\n\n145\n00:05:32,830 --> 00:05:37,510\nthe current time step entirely maybe the\n\n146\n00:05:35,830 --> 00:05:40,539\ncurrent sentence has nothing to do with\n\n147\n00:05:37,510 --> 00:05:42,849\nthe answer whereas previous one bit then\n\n148\n00:05:40,539 --> 00:05:45,820\nthere's the question module it processes\n\n149\n00:05:42,849 --> 00:05:48,310\nthe question word by word and outputs a\n\n150\n00:05:45,820 --> 00:05:50,979\nvector by using the same gru as the\n\n151\n00:05:48,310 --> 00:05:52,479\ninput module and the same weight we can\n\n152\n00:05:50,979 --> 00:05:54,849\nencode both of them by creating\n\n153\n00:05:52,479 --> 00:05:56,889\nembedding layers for both then we'll\n\n154\n00:05:54,849 --> 00:05:59,200\ncreate an episodic memory representation\n\n155\n00:05:56,889 --> 00:06:01,120\nfor both the motivation for this in the\n\n156\n00:05:59,200 --> 00:06:03,340\npaper came from the hippocampus function\n\n157\n00:06:01,120 --> 00:06:05,349\nin our brain it's able to retrieve\n\n158\n00:06:03,340 --> 00:06:08,260\ntemporal states that are triggered by\n\n159\n00:06:05,349 --> 00:06:10,660\nsome response like a site or a sound\n\n160\n00:06:08,260 --> 00:06:12,880\nboth the fact and question vectors that\n\n161\n00:06:10,660 --> 00:06:15,190\nare extracted from the input enter the\n\n162\n00:06:12,880 --> 00:06:17,590\nepisodic memory module it's composed of\n\n163\n00:06:15,190 --> 00:06:19,450\ntwo nested gr use the energy ru\n\n164\n00:06:17,590 --> 00:06:21,880\ngenerates what are called episodes it\n\n165\n00:06:19,450 --> 00:06:24,130\ndoesn't by passing over the facts from\n\n166\n00:06:21,880 --> 00:06:26,050\nthe input module but when updating its\n\n167\n00:06:24,130 --> 00:06:28,330\ninterstate it takes into account the\n\n168\n00:06:26,050 --> 00:06:30,130\noutput of an attention function on the\n\n169\n00:06:28,330 --> 00:06:32,289\ncurrent fact the attention function\n\n170\n00:06:30,130 --> 00:06:35,320\ngives a score between zero and one to\n\n171\n00:06:32,289 --> 00:06:38,050\neach fact and so the GRU ignores facts\n\n172\n00:06:35,320 --> 00:06:39,490\nwith low scores after each full pass on\n\n173\n00:06:38,050 --> 00:06:41,710\nall the facts the in\n\n174\n00:06:39,490 --> 00:06:43,750\ngru outputs an episode which is then fed\n\n175\n00:06:41,710 --> 00:06:46,120\nto the outer GRU the reason we need\n\n176\n00:06:43,750 --> 00:06:48,160\nmultiple episodes is so our model can\n\n177\n00:06:46,120 --> 00:06:50,349\nlearn what part of a sentence it should\n\n178\n00:06:48,160 --> 00:06:52,539\npay attention to after realizing after\n\n179\n00:06:50,349 --> 00:06:54,819\none pass that something else is\n\n180\n00:06:52,539 --> 00:06:56,979\nimportant with multiple passes we can\n\n181\n00:06:54,819 --> 00:06:59,710\ngather increasingly relevant information\n\n182\n00:06:56,979 --> 00:07:02,020\nwe can initialize our model and set its\n\n183\n00:06:59,710 --> 00:07:04,090\nloss function has categorical cross\n\n184\n00:07:02,020 --> 00:07:07,240\nentropy with the stochastic gradient\n\n185\n00:07:04,090 --> 00:07:09,069\ndescent implementation or MS prop then\n\n186\n00:07:07,240 --> 00:07:10,900\ntrain it on the given data using the fed\n\n187\n00:07:09,069 --> 00:07:12,669\nfunction we can test this code in the\n\n188\n00:07:10,900 --> 00:07:14,889\nbrowser without waiting for it to train\n\n189\n00:07:12,669 --> 00:07:16,930\nbecause luckily for us this researcher\n\n190\n00:07:14,889 --> 00:07:19,150\nuploaded a web app with a fully trained\n\n191\n00:07:16,930 --> 00:07:21,220\nmodel of this code we can generate a\n\n192\n00:07:19,150 --> 00:07:23,349\nstory which is a collection of sentences\n\n193\n00:07:21,220 --> 00:07:25,210\neach describing an event in sequential\n\n194\n00:07:23,349 --> 00:07:27,909\norder then we'll ask it a question\n\n195\n00:07:25,210 --> 00:07:29,590\npretty high accuracy response let's\n\n196\n00:07:27,909 --> 00:07:32,409\ngenerate another story and ask it\n\n197\n00:07:29,590 --> 00:07:34,060\nanother question hero status let's go\n\n198\n00:07:32,409 --> 00:07:36,490\nover the three key facts we've learned\n\n199\n00:07:34,060 --> 00:07:39,159\ngr use control the flow of data like\n\n200\n00:07:36,490 --> 00:07:41,530\nlstm cells but are more computationally\n\n201\n00:07:39,159 --> 00:07:43,990\nefficient using just two gates update\n\n202\n00:07:41,530 --> 00:07:46,180\nand reset dynamic memory networks offer\n\n203\n00:07:43,990 --> 00:07:48,490\nstate-of-the-art performance in question\n\n204\n00:07:46,180 --> 00:07:50,770\nentering systems and they do this by\n\n205\n00:07:48,490 --> 00:07:53,440\nusing both semantic and episodic memory\n\n206\n00:07:50,770 --> 00:07:57,039\ninspired by the hippocampus drumroll\n\n207\n00:07:53,440 --> 00:07:58,270\nplease no never mind nemanja tomek is\n\n208\n00:07:57,039 --> 00:08:00,639\nthe coding challenge winner from last\n\n209\n00:07:58,270 --> 00:08:02,530\nweek he implemented his own neural\n\n210\n00:08:00,639 --> 00:08:04,180\nmachine translator by training it on\n\n211\n00:08:02,530 --> 00:08:06,460\nmovie subtitles in both English and\n\n212\n00:08:04,180 --> 00:08:08,830\nGerman you can see all the results in\n\n213\n00:08:06,460 --> 00:08:10,930\nhis eye Python notebook amazing work\n\n214\n00:08:08,830 --> 00:08:12,550\nwizard of the week and the runner-up is\n\n215\n00:08:10,930 --> 00:08:14,680\nvishal bought two despite the massive\n\n216\n00:08:12,550 --> 00:08:16,719\namount of training time n empty requires\n\n217\n00:08:14,680 --> 00:08:19,599\nmichelle was able to achieve some great\n\n218\n00:08:16,719 --> 00:08:21,699\nresults I vow to both of you this week's\n\n219\n00:08:19,599 --> 00:08:23,740\nchallenge is to make your own Q&A chat\n\n220\n00:08:21,699 --> 00:08:25,360\nbot all the details are in the readme\n\n221\n00:08:23,740 --> 00:08:27,219\ngithub links go in the comments and\n\n222\n00:08:25,360 --> 00:08:28,599\nannounce winner a week from today please\n\n223\n00:08:27,219 --> 00:08:30,610\nsubscribe for more programming videos\n\n224\n00:08:28,599 --> 00:08:32,740\ncheck out this related video and for now\n\n225\n00:08:30,610 --> 00:08:35,849\nI've got to ask the right questions so\n\n226\n00:08:32,740 --> 00:08:35,849\nthanks for watching\n\n" Created and managed caption tracks.