【443】Tweets Analysis Q&A
【Question 01】
When converting Tweets info to csv file, commas in the middle of data (i.e. location: Sydney, NSW) can make a mistake of the csv file (creaing more columns).
The solution is to add double quotation marks on both sides of the content, like this:
fo.write("\"" + str(tweet["user"]["location"]) + "\"")
【Question 02】
When open csv file with Excel, sometimes it will show messy code, but it can show well with Notepad.
One solution is opening this file with notepad++.
Another solution is adding codes at the beginning of the writing file, like this:
fo = open(r"D:\Twitter Data\Data\test\tweets.csv", "w")
fo.write("\ufeff")
【Question 03】
Text contents contain carriage return, double quotation marks, single quotation marks. Those info will make mistakes when creating csv file.
So we should replace those characters with space or nothing, like this:
text = str(tweet["text"])
text = text.replace("\n", " ")
text = text.replace("\"", "")
text = text.replace("\'", "")
fo.write("\"" + text + "\"")
Including tweet["user"]["location"] and tweet["text"], for these two attributes, user can write whatever they want, so it's easy to make mistakes.
【Question 04】
After converting Tweets to csv file, but I can't open this file by pandas.read_csv(). The reason is there must be some problems in those data. Since there are about more than 100000+ rows of this csv file, how can I locate the error line?
Solution is coverting the first 10000 rows, if there are not errors, and then converting the next 10000 rows. If error occurs, trying to narrow the range of numbers, like error occurs between 20000 to 30000, we can change the range of numbers with 20000 to 25000. Using this method several times, we can locate the error line and find the real problems. For this spicific case, most problems are about contents include carriage return, double quotation marks, etc.
Codes like this:
... count = 0
or line in tweets_file:
try:
count += 1
if (count < 10000):
continue
... if (count > 20000):
break
except:
continue
...
【443】Tweets Analysis Q&A的更多相关文章
- 【BZOJ4815】[CQOI2017]小Q的表格(莫比乌斯反演,分块)
[BZOJ4815][CQOI2017]小Q的表格(莫比乌斯反演,分块) 题面 BZOJ 洛谷 题解 神仙题啊. 首先\(f(a,b)=f(b,a)\)告诉我们矩阵只要算一半就好了. 接下来是\(b* ...
- 【二分图】ZJOI2007小Q的游戏
660. [ZJOI2007] 小Q的矩阵游戏 ★☆ 输入文件:qmatrix.in 输出文件:qmatrix.out 简单对比 时间限制:1 s 内存限制:128 MB [问题描述] ...
- 【BZOJ4813】[CQOI2017]小Q的棋盘(贪心)
[BZOJ4813][CQOI2017]小Q的棋盘(贪心) 题面 BZOJ 洛谷 题解 果然是老年选手了,这种题都不会做了.... 先想想一个点如果被访问过只有两种情况,第一种是进入了这个点所在的子树 ...
- 【bzoj4813】[Cqoi2017]小Q的棋盘 树上dfs+贪心
题目描述 小Q正在设计一种棋类游戏.在小Q设计的游戏中,棋子可以放在棋盘上的格点中.某些格点之间有连线,棋子只能在有连线的格点之间移动.整个棋盘上共有V个格点,编号为0,1,2…,V-1,它们是连通的 ...
- 【439】Tweets processing by Python
参数说明: coordinates:Represents the geographic location of this Tweet as reported by the user or cl ...
- 【HDOJ】4515 小Q系列故事——世界上最遥远的距离
简单题目,先把时间都归到整年,然后再计算.同时为了防止减法出现xx月00日的情况,需要将d先多增加1,再恢复回来. #include <cstdio> #include <cstri ...
- 【444】Data Analysis (shp, arcpy)
ABS suburbs data of AUS 1. Dissolve Merge polygons with the same attribute of "SA2_NAME16&quo ...
- 【LeetCode】字符串 string(共112题)
[3]Longest Substring Without Repeating Characters (2019年1月22日,复习) [5]Longest Palindromic Substring ( ...
- P5346 【XR-1】柯南家族
题目地址:P5346 [XR-1]柯南家族 Q:官方题解会咕么? A:不会!(大雾 题解环节 首先,我们假设已经求出了 \(n\) 个人聪明程度的排名. \(op = 1\) 是可以 \(O(1)\) ...
随机推荐
- TVS瞬态抑制二极管选型指南
一.TVS二极管工作原理 TVS(Transient Voltage Suppressors)二极管,即瞬态电压抑制器,又称雪崩击穿二极管,是采用半导体工艺制成的单个PN结或多个PN结集成的器件.TV ...
- 任晓蕊 2019-2020-1 20199302《Linux内核原理与分析》第四周作业
实验内容 在实验楼的环境中敲入命令 cd LinuxKernel/ qemu -kernel linux-3.18.6/arch/x86/boot/bzImage -initrd rootfs.img ...
- VS - Microsoft.Practices.EnterpriseLibrary.Logging
string fileName = AppDomain.CurrentDomain.BaseDirectory + "\\log.txt";File.AppendAllText(f ...
- (尚007)Vue强制绑定class和style
注意:class和style的值是动态的值 1.test007.html <!DOCTYPE html><html lang="en"><head&g ...
- C# 中文序列按笔画排序
问题:给定一串含中文的序列,按首字符的笔画数排序 因为默认是按拼音来排序的, 借助Globalization命名空间,包含定义区域性相关信息的类,这些信息包括语言,国家/地区,正在使用的日历,日期.货 ...
- 鼠标经过盒子出现边框(伪元素,定位,css3盒子模型)
<body> <div> <img src="mi6.png" > </div> </body> div{ width: ...
- SaltStack 在 Windows 上的操作基础
SaltStack 在 windows上的操作基础 1.删除文件: salt '172.16.3.11' file.remove 'D:\downup\111.msu' 2.删除文件夹 salt '1 ...
- leetcode解题报告(33): Find All Numbers Disappeared in an Array
描述 Given an array of integers where 1 ≤ a[i] ≤ n (n = size of array), some elements appear twice and ...
- StringSequences
题意: 给出两个长度不超过\(50\)的字符串\(S, T\),每次可以在\(S\)中插入一个字符,把每次操作后的\(S\)写成一个序列,问有多少种不同的序列. 注意到我们可以把\(S\)拆分成一段一 ...
- codevs1504愚蠢的组合数 / RQNOJ愚蠢的组合数
1504 愚蠢的组合数 时间限制: 2 s 空间限制: 128000 KB 题目等级 : 黄金 Gold 题解 题目描述 Description 最近老师教了狗狗怎么算组合数,狗狗又 ...